Chinese Tokenization - Generate Accurate Insight From Chinese Sources Including Sina Weibo

Richard Caudle | 26th March 2014

We all know that China is a vitally important market for any international brand. Until recently it has been difficult to access conversation from Chinese networks and tooling support for East Asian languages has been limited. This is why at DataSift we're proud to not only now offer access to Sina Weibo, but equally important we have greatly improved our handling of Chinese text to allow you to get the most from this market.

The Challenges Of East Asian Social Data

Until now it has been difficult to access social insights from markets such as China, for two reasons:

  • Access to data: Almost all conversations take place on local social networks, rather than Twitter and Facebook. The ecosystem around these local networks has been less mature, and therefore gaining access has been more challenging.
  • Inadequate tooling: Even if you could gain access to these sources, the vast majority of tools are heavily biased towards European languages, trained on spacing and punctuation which simply don't exist in East Asian text. Inadequate tooling leads to poor and incomplete insights.

Happily our platform now solves both of these challenges for you. Firstly we now give you access to Sina Weibo. Secondly, we have greatly improved our handling of Chinese text, to give you exactly the same powers you'd expect when processing European languages. Specifically we support Mandarin, simplified Chinese text.

Incidentally, we also tokenize Japanese content which is a different challenge to Chinese text. The methods of tokenization are quite different but equally important to the accuracy of your filters. Read a detailed post here from our Data Science team.

Moving Beyond Substring Matching

In the past our customers have been able to filter Chinese content by using the substr operator. This can give you inaccurate results though because the same sequence of Chinese characters can have different meanings.

Take for example the brand Samsung, which is written as follows:


These characters are also present in the phrase "three weeks" and many place names. So a simple filter using substr like so could give you unwanted data:

interaction.content substr "三星"

It would match both of these sentences:

我爱我新的三星电视! (I love my new Samsung TV!)

我已经等我的包裹三星期了! (I've been waiting three weeks for my parcel to arrive!')

By tokenizing the text into words, and allowing our customers to filter using operators such as contains, our customers can now receive more accurately filtered data.

Tokenization 101

The key to handling Chinese text accurately is through intelligent tokenization. Through tokenization we can provide you with our full range of text matching operators, rather than simple substring matching.

I realise this might not be immediately obvious, so I'll explain using some examples.

Let's start with English. You probably know already can use CSDL (our filtering language) to look for mentions of words like so:

interaction.content contains_near "recommend,tv:4"

This will match content where the words 'recommend' and 'tv' are close together, such as:

Can anyone recommend a good TV?

This works because our tokenization engine internally breaks the content into words for matching, using spaces as word boundaries:

Can anyone recommend a good TV ?

With this tokenization in place we can run operators such as contains and contains_near.

However, with Chinese text there are no spaces between words. In fact Chinese text contains long streams of characters, with no hard and fast rules for word boundaries that can be simply implemented.

Chinese Tokenization

The translation of 'Can any recommend a good TV?' is:


With the new Chinese tokenization support, internally the platform breaks the content into words as follows:

你 能 推荐 一个 好的 电视 吗

You can recommend a good television ?

The DataSift tokenizer uses a machine learned model to select the most appropriate tokenization and gives highly accurate results. This learned model has been extensively trained is constantly updated.

Our CSDL to match this would be:

interaction.content contains_near [language(zh)] "推荐,电视:4"

The syntax [language(zh)] tells the engine that you would like to tokenize content using Chinese tokenization rules.

Best Practice

To ensure the accuracy of the filter, we recommend you add further keywords or conditions. For example, the following filters for content contain Samsung and TV:

interaction.content contains [language(zh)] "三星"

AND interaction.content contains [language(zh)] "电视"

This may seem like we're cheating(!), but in fact a native Chinese speaker would also rely on other surrounding text to decide that it is indeed Samsung the brand being discussed.

Try It For Yourself

So in summary, not only do we now provide access to Chinese social networks, but just as important our platform takes you beyond simple substring matching to give you much greater accuracy in your results.

If you don't have access to the Sina Weibo source you can start playing with Chinese tokenization immediately via Twitter. The examples above will work nicely because they work across all sources.

For a full reference on the new sources, please see our technical documentation.

To stay in touch with all the latest developer news please subscribe to our RSS feed at

Previous post: Using Japanese Tokenization To Generate More Accurate Insight

Next post: Platform Updates - Content Age Filtering, Larger Compressed Data Deliveries