Language Detection: How it Works

Our Language augmentation employs uni-grams and quad-grams to automatically detect the language in which a post is written.

Each language has a statistical 'fingerprint' which we can detect and use for identification purposes. For example, if we take 1,000 words of English and 1,000 words of Spanish, the number of instances of each letter will differ. In Spanish, the letter 'k' is rare but 'll' is very common. The uni-grams give us this simple but important character profile.

By looking at the statistics 4-character combinations, we can greatly improve our accuracy. To identify a language, we take each word and 'chunk' it by adding leading and trailing underscores. Let's look at a sentence such as "green eggs and ham". It becomes:

  • green
  • eggs
  • and
  • ham

When the augmentation generates the quad-grams we get:

  • _gre
  • gree
  • reen
  • een_
  • en_a
  • _egg
  • eggs
  • ...and so on

It is a simple matter to count how many times each combination appears and compare those figures against a database of quad-gram statistics.

Longer text samples, of course, provide a more reliable fingerprint.