Filtering By Language

A common filtering task is to find content by language. This is very easy with CSDL.

As content passes through the platform it is assigned a content language, along with a level of confidence. To find content by language you need to make use of both the language.tag and language.confidence targets.

There are a number of factors that might impact the confidence score, such as:

  • Multiple languages within the content - for example content in Spanish may contain an English quote
  • Use of unfamiliar slang, or informal shortened words - such as 'm8' for 'mate'
  • A high proportion of links or hashtags - for example a Tweet may be simply a link and a hash tag

Depending on your use case you might want to test different confidence levels. In general a confidence level of 80 (out of 100) or above fits most use cases.

Simple Language Tags

For simple language codes, such as English, French etc. you can use the language.tag and language.confidence targets:

language.tag == "en"
and language.confidence >= 80

Or, for multiple languages:

language.tag in "en,fr,de"
and language.confidence >= 80

Extended Language Tags

For extended language codes, such as Simpleified Chinese, you will need to use the language.tag_extended target.

language.tag_extended == "zh-cn"
and language.confidence >= 80