Language Detection v2.0

chrisg | 26th June 2012

Hello, my name is Christopher Gilbert, and I am a senior member of the DataSift engineering team. Today, I am pleased to announce the release of a major revision of the language augmentation service, Language Detector v2.0, which provides improved accuracy and increases the number of languages that DataSift supports. The new language detector features an improved algorithm, which can detect more than 140 languages with a higher degree of accuracy than was previously possible.


Based on customer feedback, we saw an opportunity to improve the language augmentation service. Language detection is a difficult engineering problem given our target data set. A large number of Tweets contain abbreviations or even ASCII art; are relatively short and frequently contain only one or two words. We started by comparing our language detector to other similar language detection services, and we saw that there was still an opportunity to improve on what we had.

When making a major change, a considerable amount of engineering effort goes into researching, developing, and testing various aspects of the new system. Besides the usual stability and load testing, we have meticulously profiled the effects of the new service, and gathered metrics which show an expected 30% improvement in overall accuracy.

New CSDL Target

As well as supporting more languages, we are also introducing a new CSDL target called language.confidence. This is a percentage measure of how strongly the language sample correlates with the matched language. For example, if one piece of text contains both Spanish and French, then both languages will be detected, but the top scoring result will be matched. The language.confidence target will allow users who require greater accuracy to filter interactions that have more similarity to the language they wish to target.

For example, the following CSDL will filter for English interactions, and filter out interactions for which the detectors’ confidence level is low.

language.tag == "en" and language.confidence > 60

New Language Support

Amongst the languages we now support, Arabic, Turkish, and several other languages that were not available in the previous version of the language detector are now available via their associated language codes. For a complete list of all supported languages and their corresponding language codes, please see the Developer Documentation.

Comparing v1.0 to v2.0

In the following chart, we can see how the new language detector compares to the old one when tested with a sample set consisting of 30,000 interactions, recorded over a 24 hour period.


This next chart shows the relative proportion of languages detected by the new language detector in the same sample set. Here we can see that English is still an important language, accounting for just over half of the data, with Japanese taking the next biggest slice of the pie with 10.3%, and Spanish coming close third with 9.1%.



We have upgraded the language detector augmentation, further reinforcing our commitment to delivering a high quality and cost efficient service to our customers, and the greater spread of supported languages combined with the new language.confidence target will enable customers to more easily filter interactions in their target demographic.

Previous post: Newcomer's Guide to DataSift's Streaming API

Next post: A Journey into Optimizing Hadoop Jobs