Using Japanese Tokenization To Generate More Accurate Insight

Hiroaki Watanabe | 12th March 2014

At the heart of DataSift’s social data platform is a filtering engine that allows companies to target the text, content and conversations that they want to extract for analysis. We are proud to announce that we have expanded our platform to include Japanese, one of the fastest growing international markets for Twitter.

Principles Of Tokenization

This provides new challenges for how we can accurately filter to identify and extract relevant content and conversations. The main challenge to overcome is that Japanese, unlike Western languages, is written without the word boundaries (i.e. whitespace).

Imagine tackling this challenge in English to create a meaningful sentence from the sequence of characters from the first sentence of Lewis Carroll’s "Alice's Adventures in Wonderland".

Alicewasbeginningtogetverytiredofsittingbyhersisteronthebank,ando

fhavingnothingtodo:onceortwiceshehadpeepedintothebookhersisterw

asreading,butithadnopicturesorconversationsinit,'and whatistheuseof

abook,'thoughtAlice'withoutpicturesorconversation?'

You may find it easy to complete this task, but two important essences of Natural Language Processing (NLP) are involved in this exercise. From an algorithmic point of view:

  • Once we have options for where word-boundaries sit (Ali? Alice?, Alicew?), the number of possibilities could exponentially increase in the worst case and
  • Numerical score may help to rank the possible outcomes.

Let us see how these two points are relevant to Japanese Tweets. The following five characters form a popular sentence that can be tokenized into the two blocks of characters with meaning:

まじ面白い == (tokenization) ==> まじ 面白い

in which a white space is inserted between “じ” and “面”. In NLP, this process is called “tokenization” or "word chunking".

The meaning of this sentence is “seriously (まじ) interesting (面白い)". The first two characters, まじ, represent a popular slang often attached to sentimental words. Although “まじ” is a good indicator for sentiment, we can find them in other common words (e.g., おまじない [good luck charm], すさまじい[terrible])) where the meaning of “まじ” (seriously) is no longer present.

This simple Japanese case study highlights that:

  • You cannot apply a simple string searching algorithm for searching keywords (i.e. search for the sub-string (まじ) within the text as it can easily introduce errors )
  • The decision whether or not to tokenize can be affected by surrounding characters.

Approaches For Japanese Tokenization

In industry, there are two main approaches to solve this tokenization problem: (a) Morphological analysis and (b) N-gram. The N-gram approach generates blocks of characters systematically "without" considering the meanings from training examples and generates numerical scores by counting the frequency of each block. Because of this brute-force approach, the processing speed could be slow with large memory usage, however, it is strong for handling new “unknown words” since we do not need any dictionary.

In Datasift's platform, we implemented the Morphological approach for Japanese tokenization since it has advantages in terms of “speed” and “robustness for noise”. One drawback of the standard Morphological approach is its difficulty for handling unknown “new words”. Imagine the case where you see an unknown sequence of characters in the ‘Alice’ example.

Our software engineers have provided a great solution for this “new words” issue by twisting the standard Morphological approach. Thanks to our new algorithm, we successfully provide Japanese language service accurately for noisy Japanese Tweets without updating dictionary.

Putting It Into Practice: Tips For Japanese CSDL

If you are familiar with our filtering language (CSDL), you can apply our new Japanese tokenizer by simply adding a new modifier, [language(ja)], as follows:

interaction.content contains [language(ja)] "まじ" and
interaction.content contains [language(ja)] "欲しい"

Note that “欲しい” is “want” in English.

You can mix Japanese and other languages as well:

interaction.content contains [language(ja)] "ソニー" or
interaction.content contains "SONY"

Note that the keyword “ソニー” is analyzed using our Japanese tokenizer whereas our standard tokenizer is applied for the keyword “SONY” in this example.

Tagging (our rules-based classifier) also works for Japanese:

tag "positive" { interaction.content contains_any [language(ja)] "うれしい, 楽しい"}
tag "negative" { interaction.content contains_any [language(ja)] "悲しい, 楽しくない"}
tag "ソニー" { interaction.content contains [language(ja)] "ソニー"} 

return{
   interaction.content contains [language(ja)] "まじ"
}

Note that the first two lines contains sentiments: “うれしい(happy)”, “楽しい(fun)”, “悲しい(sad)” and “楽しくない(sad)”.

Currently we support two main operators, “contains” and “contains_any”, for the [language(ja)] modifier. Our “substr” operator also works for Japanese although it may cause some noises as I explained above:

interaction.content substr "まじ"

Advanced Filtering - Stemming

An advanced tip to increase the number of filtering results is to consider the “inflection” of the Japanese language. Since Japanese is an agglutinative language, stems of words appear more often in Tweets. Our Morphological approach allows us to use “stem” as a keyword.

For example, the following CSDL could find tweets with “欲しい”, “欲しすぎて”, “欲しー”. :

interaction.content contains [language(ja)] "欲し"

It’s worth mentioning that there is no perfect solution for tokenization at the moment; N-gram approach has weakness for noise whereas the Morphological approach may not understand some of new words. If you find that a filter produces no output, you may try our “substr” operator which is our implementation of “string search algorithm”.

The above tagging example can be converted in a version that uses “substr” as follows:

tag "positive" { interaction.content substr "うれしい" or interaction.content substr "楽しい"}
tag "negative" { interaction.content substr "悲しい" or interaction.content substr "楽しくない"}
tag "ソニー" { interaction.content substr "ソニー"} 

return{
   interaction.content substr "まじ"
}

Working Example For Japanese Geo-Extraction

Extracting users’ geological information is an interesting application. The following CSDL allows you to tag your filtered results using geo information, Tokyo (東京).

tag.location "東京" {
   twitter.user.description contains_any [language(ja)] "東京, tokyo" or
   twitter.user.location contains_any [language(ja)] "東京, tokyo" or
   twitter.retweet.user.description contains_any [language(ja)] "東京, tokyo" or
   twitter.retweet.user.location contains_any [language(ja)] "東京, tokyo"
}

return{
   interaction.content contains [language(ja)] "まじ"
}

Note that “まじ” is used as a keyword for filtering in this example.

In Summary

  • Tokenization is an important technique to extract correct signals from East Asian languages.
  • N-gram and Morphological analysis are the two main techniques available.
  • Datasift has implemented a noise-tolerant Morphological approach for Japanese with some extensions to handle new words accurately.
  • By adding our new modifier [language(ja)] in CSDL, you can activate our Japanese tokenization engine in our distributed system.
  • We can mix Japanese and other languages within a CSDL filter to realize unified and centralized data analysis.

Previous post: Announcing LexisNexis - Monitor Reputation, Threats & Opportunities Through Global News Coverage

Next post: Chinese Tokenization - Generate Accurate Insight From Chinese Sources Including Sina Weibo