Mandarin Tokenization

DataSift can automatically inject whitespace into Mandarin Chinese text to divide it into chunks before filtering. This pre-processing step allows our filtering engine to perform word or phrase matching correctly. Without this feature, you would be restricted to simple substring searches.


You can use Mandarin Chinese chunking with the contains and contains_any operators by adding an optional switch called "language". For example:

interaction.content contains [language(zh)] "微博"

If you want to skip automatic chunking, use:

interaction.content contains [language(none)] "微博"

By the way, “微博” is the Mandarin Chinese translation of "weibo" which means microblog.


  1. Filter for a string. Here, we'll use the name of a Chinese TV show:

    interaction.content contains [language(zh)] "甄嬛传"

    In this example, DataSift adds whitespace between the second and third characters because the title of the show "甄嬛传" includes the name of a person in the first two characters (甄嬛).

    The three-character argument is the name of a new TV show, something that DataSift has not encountered before. Nevertheless, the chunking algorithm recognizes that it needs to add the space, creating a token from the first two characters and a token from the final character. The result is that all three of these tags match.

    tag "case1" {interaction.content any [language(zh)] "甄嬛传"}      #tv show tag "case2" {interaction.content any [language(zh)] "甄嬛"}       #甄嬛 detector tag "case3" {interaction.content any [language(zh)] "传"}         #传 detector

  2. Filter for comments from the youth demographic by recognizing commonly encountered Internet slang:

    interaction.content contains [language(zh)] "酷"

    The string "酷" means "cool" in Mandarin Chinese. This keyword is often used by young people and it frequently modifies another term.

    The word “残酷” (which is Mandarin Chinese for "cruel") happens to contain the string “酷” but the chunking algorithm is intelligent enough to leave “残酷” as a single chunk.

  3. Combining these filters and adding tags, we have:

    tag "tv" {interaction.content contains [language(zh)] "甄嬛传"}
    tag "slang" {interaction.content contains [language(zh)] "酷"}
    return {
        interaction.type == "twitter" and (
            interaction.content contains [language(zh)] "甄嬛传" and
            interaction.content contains [language(zh)] "酷"


The Query Builder does not support the language() modifier for Mandarin tokenization.

If you need to use this modifier, convert your rules into CSDL and edit with the CSDL Code Editor.