A common filtering task is to find content by language. This is very easy with CSDL.
As content passes through the platform it is assigned a content language, along with a level of confidence. To find content by language you need to make use of both the language.tag and language.confidence targets.
There are a number of factors that might impact the confidence score, such as:
- Multiple languages within the content - for example content in Spanish may contain an English quote
- Use of unfamiliar slang, or informal shortened words - such as 'm8' for 'mate'
- A high proportion of links or hashtags - for example a Tweet may be simply a link and a hash tag
Depending on your use case you might want to test different confidence levels. In general a confidence level of 80 (out of 100) or above fits most use cases.
Simple Language Tags
language.tag == "en" and language.confidence >= 80
Or, for multiple languages:
language.tag in "en,fr,de" and language.confidence >= 80
Extended Language Tags
For extended language codes, such as Simpleified Chinese, you will need to use the language.tag_extended target.
language.tag_extended == "zh-cn" and language.confidence >= 80