Japanese Tokenization

DataSift can automatically inject whitespace into Japanese text to divide it into tokens before filtering. This pre-processing step allows our filtering engine to perform word or phrase matching correctly. Without this feature, you would be restricted to simple substring searches.


You can use Japanese tokenization with the contains and contains_any operators by adding an optional switch called "language". For example:

   interaction.content contains [language(ja)] "データシフト"

If you want to skip automatic tokenization, use:

   interaction.content contains [language(none)] "データシフト"

By the way, "データシフト" is the Japanese translation of DataSift.


  1. Filter for a string. Here, we'll use the name of a Japanese TV show:

    interaction.content contains [language(ja)] "半沢直樹"

    In this example, DataSift adds whitespace between the second and third characters because the title of the show "半沢直樹" is also the name of a person. The first two characters "半沢" are the last name and the remaining characters "直樹" are the first name.

    The argument is the name of a new TV show so it is something that DataSift has not encountered before. Nevertheless, the tokenization algorithm is powerful enough to handle this new "word" effectively.

  2. Filter for comments from the youth demographic by recognizing commonly encountered Internet slang:

    interaction.content contains [language(ja)] "まじ"

    The string "まじ" means "seriously" in Japanese. This keyword is often used by young people, and it frequently modifies another term. For example, "seriously interesting" is "まじ面白い".

    The word "まじない" (which is Japanese for "magic spell") happens to contain the string "まじ" but the tokenization algorithm is intelligent enough to leave "まじない" as a single token.

  3. Combining these filters and adding tags, we have:

    tag "tv" {interaction.content contains [language(ja)] "半沢直樹"}
    tag "slang" {interaction.content contains [language(ja)] "まじ"}
    return {
        interaction.type == "twitter" and (
            interaction.content contains [language(ja)] "半沢直樹" and
            interaction.content contains [language(ja)] "まじ"


The Query Builder does not support the language() modifier for Japanese tokenization.

If you need to use this modifier, convert your rules into CSDL and edit with the CSDL Code Editor.