Richard Caudle's picture

Announcing Tencent Weibo - Broaden Your Coverage Of Chinese Conversation

In a previous post I discussed how we're broadening our reach to help you get the best out of East Asian sources such as using our Chinese tokenization engine.
To build on this momentum, I'm excited to be able to announce a new data source for Tencent Weibo, another huge Chinese network you'll be eager to get your hands on. Now you can build more comprehensive solutions for the Chinese market with ease.

Tencent Weibo - A Key Piece In Your Chinese Social Jigsaw

China has the most active social network community in the world. With over 600 million Internet users on average spending 40% of their online time using social networks, there's an awful lot of conversation out there which no doubt you'd love to harness.
There are a wide variety of social networks used in China, one of the largest is Tencent Weibo. Tencent Weibo gives great coverage of 3rd and 4th tier cities, essentially emerging markets which already have large populations and are experiencing massive growth. To generate full insights, and generate maximum opportunity from Chinese markets it is essential that you listen to these conversations.

Understanding Tencent Weibo Interactions

Tencent Weibo is modelled largely on Twitter. Just like Twitter users can use up to 140 characters for a post, and can share photos and videos. As a result Tencent Weibo lends itself to similar use cases you may already have set up with Twitter.
We expose as much data as possible to you through targets. A full list of the Tencent Weibo targets can be found in our documentation. Here are a few highlights to get you started though.

Types of Interaction

Tencent also has it's own types of activity which are very similar to Twitter. A 'post' is the equivalent of a tweet, and a 'repost' is the equivalent of retweet. 
A reply is slightly different however. If you reply on Twitter, you mention the user you are replying to. On Tencent Weibo when you reply you are actually continuing a specific thread and do not need to mention the user you are replying to.
To distinguish between these types you can use the tencentweibo.type target.

Thread ID

As I mentioned above Tencent Weibo runs a threaded conversation model. You can filter to certain conversations by using the thread ID, exposed by the tencentweibo.thread_id target.
This is very useful because you can for example pick up a first post which discusses a topic you're interested in, then you can make a note of the thread ID and track any replies which follow.

Author's Influence

Frequently you'll want to know a little more about the content's author. Three useful pieces of metadata you can work with are:
  • The number of followers a user has
  • The number of users the user follows
  • The number of posts the user has created
Commonly we use similar features to identify spam on Twitter. For example we might filter out content from users who follow a high number of users, but themselves have few followers, as this is a common signature for a bot.

Tencent In Action

Ok, so you've decided that you want to tap into the world of Tencent Weibo conversation. How does this work in practice? Let's look at a quick example.
A common use of the new data source will be brand monitoring, so let's write some CSDL that picks out well-known brands from Tencent Weibo chatter. For this example I'm going use the targets I discussed above to filter down to influential authors who are posting original content, this will give us the more pertinent data for our use case.
To filter to influential users I can use the target: >= 10000
To filter to original posts (so exclude replies and reposts) I can use the tencentweibo.type target: 
tencentweibo.type == "post"
To filter to a list of brands I'm interested in (Coca-Cola, Walmart, etc.): 
tencentweibo.text contains_any [language(zh)] "可口可乐, 谷歌, 沃尔玛, 吉列, 亚马逊, 麦当劳, 联合利华, 葛兰素史克, 路虎, 维珍航空"
Trust me for now on the translations! Things will get clearer in a minute.
The expression I've used here uses the tencentweibo.text target, which exposes the text content of the post. Following this I make use of Chinese tokenization, using the [language(zh)] switch as explained in my previous post to ensure accurate matching of my brand names.
My finished filter becomes: 
So now I have a stream of original content from influential authors discussing my collection of brands. In just a few minutes and extremely powerful definition.

A Helping Hand From VEDO

Honestly, I struggle when working with Chinese data, because I can't speak a word of Mandarin or Cantonese. (I did once spend a month in China and picked up my Chinese nickname of 'silver dragon', but unfortunately I got no further.) Fortunately I can make use of VEDO tagging to help me understand the data.
I can write a simple tag to pick out each brand mention, for example "Coca-Cola", as follows:
tag.brand "Coca-Cola" { tencentweibo.text contains [language(zh)] "可口可乐"
Notice that tag.brand is part of VEDO tagging, this declares a namespace for the "Coca-Cola" tag which follows. The braces that follow the tag contain an expression, which if matched for an interaction will cause the tag to be applied to the interaction. When the data arrives at my application the data is tagged with the brand name in English and therefore makes it much easier for me to work with.
Remember that VEDO tags are applied to data that has been first filtered by a filter wrapped in the return clause. In my final definition I'll add a line for each brand. 
For a refresher on VEDO, please take a look at my earlier posts

Putting It All Together

I can put my filter together with my tags by wrapping the filter in a return clause. My completed CSDL is as follows:
Running this stream in preview you can see that conversation on Tencent Weibo is being nicely categorised so it can be much more easily understood.

Over To You...

This concludes my whirlwind introduction to Tencent Weibo. Technology aside, it's definitely worth emphasising again that Tencent Weibo is a vital source if you want to maximise opportunities in Chinese marketplaces. 
For a full reference on Tencent Weibo targets, please see our technical documentation.
To stay in touch with all the latest developer news please subscribe to our RSS feed at
Richard Caudle's picture

Chinese Tokenization - Generate Accurate Insight From Chinese Sources Including Sina Weibo

We all know that China is a vitally important market for any international brand. Until recently it has been difficult to access conversation from Chinese networks and tooling support for East Asian languages has been limited. This is why at DataSift we're proud to not only now offer access to Sina Weibo, but equally important we have greatly improved our handling of Chinese text to allow you to get the most from this market.

The Challenges Of East Asian Social Data

Until now it has been difficult to access social insights from markets such as China, for two reasons:
  • Access to data: Almost all conversations take place on local social networks, rather than Twitter and Facebook. The ecosystem around these local networks has been less mature, and therefore gaining access has been more challenging.
  • Inadequate tooling: Even if you could gain access to these sources, the vast majority of tools are heavily biased towards European languages, trained on spacing and punctuation which simply don't exist in East Asian text. Inadequate tooling leads to poor and incomplete insights.
Happily our platform now solves both of these challenges for you. Firstly we now give you access to Sina Weibo. Secondly, we have greatly improved our handling of Chinese text, to give you exactly the same powers you'd expect when processing European languages. Specifically we support Mandarin, simplified Chinese text.
Incidentally, we also tokenize Japanese content which is a different challenge to Chinese text. The methods of tokenization are quite different but equally important to the accuracy of your filters. Read a detailed post here from our Data Science team.

Moving Beyond Substring Matching

In the past our customers have been able to filter Chinese content by using the substr operator. This can give you inaccurate results though because the same sequence of Chinese characters can have different meanings. 
Take for example the brand Samsung, which is written as follows:
These characters are also present in the phrase "three weeks" and many place names. So a simple filter using substr like so could give you unwanted data:
interaction.content substr "三星"
It would match both of these sentences:
我爱我新的三星电视!  (I love my new Samsung TV!)
我已经等我的包裹三星期了!  (I've been waiting three weeks for my parcel to arrive!')
By tokenizing the text into words, and allowing our customers to filter using operators such as contains, our customers can now receive more accurately filtered data.

Tokenization 101

The key to handling Chinese text accurately is through intelligent tokenization. Through tokenization we can provide you with our full range of text matching operators, rather than simple substring matching. 
I realise this might not be immediately obvious, so I'll explain using some examples.
Let's start with English. You probably know already can use CSDL (our filtering language) to look for mentions of words like so:
interaction.content contains_near "recommend,tv:4"
This will match content where the words 'recommend' and 'tv' are close together, such as: 
Can anyone recommend a good TV?
This works because our tokenization engine internally breaks the content into words for matching, using spaces as word boundaries:
Can anyone recommend a good TV ?
With this tokenization in place we can run operators such as contains and contains_near.
However, with Chinese text there are no spaces between words. In fact Chinese text contains long streams of characters, with no hard and fast rules for word boundaries that can be simply implemented.

Chinese Tokenization

The translation of 'Can any recommend a good TV?' is:
With the new Chinese tokenization support, internally the platform breaks the content into words as follows:
推荐 一个 好的 电视
You can recommend a good television ?
The DataSift tokenizer uses a machine learned model to select the most appropriate tokenization and gives highly accurate results. This learned model has been extensively trained is constantly updated.
Our CSDL to match this would be:
interaction.content contains_near [language(zh)] "推荐,电视:4"
The syntax [language(zh)] tells the engine that you would like to tokenize content using Chinese tokenization rules.

Best Practice

To ensure the accuracy of the filter, we recommend you add further keywords or conditions. For example, the following filters for content contain Samsung and TV:
interaction.content contains [language(zh)] "三星"
AND interaction.content contains [language(zh)] "电视"
This may seem like we're cheating(!), but in fact a native Chinese speaker would also rely on other surrounding text to decide that it is indeed Samsung the brand being discussed.

Try It For Yourself

So in summary, not only do we now provide access to Chinese social networks, but just as important our platform takes you beyond simple substring matching to give you much greater accuracy in your results.
If you don't have access to the Sina Weibo source you can start playing with Chinese tokenization immediately via Twitter. The examples above will work nicely because they work across all sources.
For a full reference on the new sources, please see our technical documentation.
To stay in touch with all the latest developer news please subscribe to our RSS feed at
Hiroaki Watanabe's picture

Using Japanese Tokenization To Generate More Accurate Insight

At the heart of DataSift’s social data platform is a filtering engine that allows companies to target the text, content and conversations that they want to extract for analysis. We are proud to announce that we have expanded our platform to include Japanese, one of the fastest growing international markets for Twitter.

Principles Of Tokenization

This provides new challenges for how we can accurately filter to identify and extract relevant content and conversations. The main challenge to overcome is that Japanese, unlike Western languages, is written without the word boundaries (i.e. whitespace).
Imagine tackling this challenge in English to create a meaningful sentence from the sequence of characters from the first sentence of Lewis Carroll’s "Alice's Adventures in Wonderland".
asreading,butithadnopicturesorconversationsinit,'and whatistheuseof
You may find it easy to complete this task, but two important essences of Natural Language Processing (NLP) are involved in this exercise. From an algorithmic point of view:
  • Once we have options for where word-boundaries sit (Ali? Alice?, Alicew?), the number of possibilities could exponentially increase in the worst case and
  • Numerical score may help to rank the possible outcomes.
Let us see how these two points are relevant to Japanese Tweets. The following five characters form a popular sentence that can be tokenized into the two blocks of characters with meaning:
まじ面白い    == (tokenization) ==>     まじ  面白い
in which a white space is inserted between “じ” and “面”. In NLP, this process is called “tokenization” or "word chunking".
The meaning of this sentence is “seriously (まじ) interesting (面白い)". The first two characters, まじ, represent a popular slang often attached to sentimental words. Although “まじ” is a good indicator for sentiment, we can find them in other common words  (e.g., おまじない [good luck charm], すさまじい[terrible])) where the meaning of “まじ” (seriously) is no longer present. 
This simple Japanese case study highlights that:
  • You cannot apply a simple string searching algorithm for searching keywords (i.e. search for the sub-string (まじ) within the text as it can easily introduce errors )
  • The decision whether or not to tokenize can be affected by surrounding characters.


Approaches For Japanese Tokenization

In industry, there are two main approaches to solve this tokenization problem: (a) Morphological analysis and (b) N-gram. The N-gram approach generates blocks of characters systematically "without" considering the meanings from training examples and generates numerical scores by counting the frequency of each block. Because of this brute-force approach, the processing speed could be slow with large memory usage, however, it is strong for handling new “unknown words” since we do not need any dictionary.
In Datasift's platform, we implemented the Morphological approach for Japanese tokenization since it has advantages in terms of “speed” and “robustness for noise”. One drawback of the standard Morphological approach is its difficulty for handling unknown “new words”. Imagine the case where you see an unknown sequence of characters in the ‘Alice’ example.
Our software engineers have provided a great solution for this “new words” issue by twisting the standard Morphological approach. Thanks to our new algorithm, we successfully provide Japanese language service accurately for noisy Japanese Tweets without updating dictionary.

Putting It Into Practice: Tips For Japanese CSDL

If you are familiar with our filtering language (CSDL), you can apply our new Japanese tokenizer by simply adding a new modifier, [language(ja)], as follows:
interaction.content contains [language(ja)] "まじ" and
interaction.content contains [language(ja)] "欲しい"
Note that “欲しい” is “want” in English.
You can mix Japanese and other languages as well:
interaction.content contains [language(ja)] "ソニー" or
interaction.content contains "SONY"
Note that the keyword “ソニー” is analyzed using our Japanese tokenizer whereas our standard tokenizer is applied for the keyword “SONY” in this example.
Tagging (our rules-based classifier) also works for Japanese:
Note that the first two lines contains sentiments: “うれしい(happy)”, “楽しい(fun)”, “悲しい(sad)” and “楽しくない(sad)”.
Currently we support two main operators, “contains” and “contains_any”, for the [language(ja)] modifier. Our “substr” operator also works for Japanese although it may cause some noises as I explained above:
interaction.content substr "まじ"

Advanced Filtering - Stemming

An advanced tip to increase the number of filtering results is to consider the “inflection” of the Japanese language. Since Japanese is an agglutinative language, stems of words appear more often in Tweets. Our Morphological approach allows us to use “stem” as a keyword.
For example, the following CSDL could find tweets with “欲しい”, “欲しすぎて”, “欲しー”. :
interaction.content contains [language(ja)] "欲し"
It’s worth mentioning that there is no perfect solution for tokenization at the moment; N-gram approach has weakness for noise whereas the Morphological approach may not understand some of new words. If you find that a filter produces no output, you may try our “substr” operator which is our implementation of “string search algorithm”.
The above tagging example can be converted in a version that uses “substr” as follows:

Working Example For Japanese Geo-Extraction

Extracting users’ geological information is an interesting application. The following CSDL allows you to tag your filtered results using geo information, Tokyo (東京).
Note that “まじ” is used as a keyword for filtering in this example.

In Summary

  • Tokenization is an important technique to extract correct signals from East Asian languages.
  • N-gram and Morphological analysis are the two main techniques available.
  • Datasift has implemented a noise-tolerant Morphological approach for Japanese with some extensions to handle new words accurately.
  • By adding our new modifier [language(ja)] in CSDL, you can activate our Japanese tokenization engine in our distributed system.
  • We can mix Japanese and other languages within a CSDL filter to realize unified and centralized data analysis. 
Richard Caudle's picture

Announcing LexisNexis - Monitor Reputation, Threats & Opportunities Through Global News Coverage

At DataSift we are chiefly known for our social data coverage, but increasingly you will see us broadening our net. LexisNexis provides news content from more than 20,000 media outlets worldwide, including content from newspapers, consumer magazines, trade journals, key blogs and TV transcripts. As such it provides an unrivalled source for reputation management, opportunity identification and risk management.

The LexisNexis Source

LexisNexis is a long-established, highly regarded provider of news coverage which is already relied upon by a wide range of organisations worldwide. The LexisNexis source, now available on our platform, gives you a compliant source for fully licensed, full text articles. The breadth of LexisNexis's coverage is truly impressive, and when put alongside our social data sources opens up a whole new range of possibilities to you.

How Could You Use It?

Social data, although rich with opinion and potential insight is only one part of the picture. In many cases to get a full picture you will want to see how a topic is being covered in the published media.
Some use cases that spring to mind include:
  • Reputation management: Spot important trends, new opportunities and potential threats and act on them before anyone else. By monitoring news content you can proactively monitor negative opinions, adverse developments and identify risks. Alongside LexisNexis you could add social data sources, so monitor reputation on both social networks and published media. 
  • Opportunity identification: By staying on top of the latest news stories, companies can anticipate customers' emerging needs and stay one step ahead of their competition. LexisNexis covers newspapers, press releases, specialist trade journals and regional publications so you can stay on top of breaking news.
  • Risk monitoring: There are many factors that can impact business performance, including the state of local economies, political upheaval and legislative change. Using LexisNexis news and legal coverage, keep abreast of issues that impact your suppliers and clients, and changes in local markets that could harm your business around the globe.

An Example Filter

To make things a little more concrete, let's consider the example of reputation management. 
Let's imagine I work for a large corporation and I want to monitor what is being said about my corporation in my local market across magazines, newspapers and by broadcasters. I can listen for mentions and alert my PR team, who can take steps to redress or amplify the coverage as necessary.
A simple example filter could be:
Using a DataSift destination I could integrate this data immediately as it arrives in to my existing tools and systems and inform my PR team.

LexisNexis SmartIndexing Technology™

As a quick aside, this seems to be a good time to discuss indexing / categorisation. LexisNexis through their SmartIndexing Technology, provide comprehensive indexing of content. This indexing identifies subjects, industries, companies, organizations, people and places and is exposed through the platform under the lexisnexis.indexing property. LexisNexis's advanced indexing operates beyond explicit keywords, identifying topics that are implied through context and previous experience.
This indexing feature greatly simplifies your queries and gives the content far richer context and meaning which you can take advantage of. This of course adds to the augmentations and custom categorisation features of the DataSift platform.
You can see in the example above I've used the company and country index to filter to Apple plus USA. If I'd filtered for 'Apple' using just keywords gives ambiguous results, so the indexing feature is extremely valuable here and gives much more accurate results.

LexisNexis + VEDO

Taking the example above one step further, I can also take advantage of VEDO tagging & scoring.
For instance, I can use scoring to give a notion of priority to the mentions so I can inform my PR team which are the most important mentions to act upon. As an illustrative example:
When the data is received by my PR team they can now easily prioritise their actions based on the scoring rules.

Can The LexisNexis Source Help You?

The addition of LexisNexis to the DataSift source family is an exciting step as use cases such as reputation and risk management are now so vital to organisations. Watch this space for further announcements on new sources as we continue to expand from our social roots.
For a full reference on the new source, please see our technical documentation.
To stay in touch with all the latest developer news please subscribe to our RSS feed at and keep an eye on our twitter account @DataSiftDev.
Richard Caudle's picture

Introducing The MySQL Destination - Integrate Data Effortlessly Into Your Enterprise Solution

One key challenge for developer creating a solution is integrating, often many, data sources. DataSift destinations take away this headache, especially the recently released MySQL destination.

The MySQL destination allows you to map and flatten unstructured data to your database schema, avoid writing needless custom integration code and handles realtime delivery challenges such as dropped connections so you don't have to.

Relieving Integration Challenges

The DataSift platform offers many awesome places to push data, but often let's face it, we all like to see data in a good old fashioned database. Relational databases such as MySQL are still the backbone of enterprise solutions.

Receiving a stream of unstructured data, structuring it, then pushing the data into a relational database can cause a number of headaches. The new MySQL destination makes the job straight forward so that you can concentrate on getting maximum value out of your data. It provides the following features:

  • Guaranteed delivery - Data delivery is buffered and caters for dropped connections and delivery failure
  • Delivery control - Data delivery can be paused and resumed as you require under your control
  • Data mapping - Specify precisely how you want fields (within each JSON object) to be mapped to your MySQL schema

These features combined make pushing data from DataSift into a MySQL database extremely easy.

The MySQL Destination

As with any other type of destination, the easiest way to get started is to go to the Destinations page. Choose to add a new MySQL destination to your account.

Note that the MySQL destination is only currently available to enterprise customers. Contact your sales representative or account manager if you do not see the destination listed in your account.


To set up the destination you need to enter a name, the host and port of your MySQL server, the destination database schema and authentication details.

You also need to provide a mappings file. This file tells the destination which fields within the JSON data you would like to be mapped to tables in your database schema. More details on this in a minute.

It's worth using the Test Connection button as this will check that your MySQL server is accessible to our platform, the database exists, the security credentials are valid and that the mapping file is valid.

Note that you can also create the destination via our API. This process is documented here.

Mapping Data To A Schema

The basic connection details above are self-explanatory, but the mapping file definitely needs a little more explanation. There are many things to consider when mapping unstructured data to a relational set of tables.

Let me take you through an example schema and mapping file to help clarify the process. These have been designed to work with Twitter data. The two files I'll be discussing are:

MySQL Schema

In the example schema the following tables are included, which give us a structure to store the tweets.

  • interaction - Top-level properties of each interaction / tweet. All tables below reference interactions in this table.
  • hashtags - Hashtags mentioned for each interaction
  • mentions - Twitter mentions for each interaction
  • links - Links for each interaction
  • tag_labels - VEDO tags for each interaction
  • tag_scores - VEDO scores for each interaction

The example schema is quite exhaustive, please don't be put off! You can more than likely use a subset of fields and tables to store the data you need for your solution. You might also choose to write views that transform data from these tables to fit your application.

Now's not the time to cover MySQL syntax, I'm sure if you're reading this post you'll be used to creating schemas. Instead I'll move on to the mapping file, which is where the magic lies.

Mapping File

The mapping file allows you to specify what tables, columns and data types the raw data should be mapped to in your schema. I can't cover every possibility in one post, so for full details see our technical documentation pages. To give you a good idea though, I'll pick out some significant lines from the example mapping file.

Let's pretend we have the following interaction in JSON (I removed many fields for brevity):


Tables, Datatypes & Transforms

The first line tells the processor you want to map the following columns of the 'interaction' table to fields in the JSON structure.


The next line, tells the processor to map the path to the interaction_id column of the table:

interaction_id =

Skipping a couple of lines, the following tells the processor to map interaction.created_at to the created_at column. You'll notice though that we have additional data_type and transform clauses.

created_at = interaction.created_at (data_type: datetime, transform: datetime)

If you don't explicitly specify a data_type then the processor will attempt to decide the best type for itself by inspecting the data value. In the majority of cases this is perfectly ok, but in this line we ensure that the type is a datetime.

The transform clause gives you access to some useful functions. Here we are using the datetime function to cast the string value in the data to a valid datetime value.

Later on for the same table you'll see this line which uses a different transform function:

is_retweet = (data_type: integer, transform: exists)

Here the function will return true if the JSON object has this path present, otherwise it will return false.



Now let's move down to the hashtags table mapping. You'll see this as the first line:

[hashtags :iter = list_iterator(interaction.hashtags)]

This table mapping uses an iterator to map the data from an array to rows in a table. The line specifies that any items within the interaction.hashtags array should each be mapped to one row of the hashtags table. For our example interaction, a row would be created for each of 'social' and 'marketing'.

Note that we can refer to the current item in the iterator by using the :iter variable we set in the table mapping declaration:

hashtag = :iter._value

Here _value is a reserved property representing the value of the item in the array. You can also access _path which is the relative path within the object of the value. If we were using a different type of iterator, for example over an array of objects we could reference properties of the current object, such as

There are a number of iterators you can use to handle different types of data structure:

  • list_iterator - Maps an array of values at the given path to rows of a database table.
  • objectlist_iterator - Like list_iterator, but instead is used to iterate over an array of objects, not simple values.
  • path_iterator - Flattens all properties inside an object, and it's sub objects, to give you a complete list of properties in the structure.
  • leaf_iterator - Like path_iterator, however instead of flattening object properties, instead flattens any values in arrays within the structure to one complete list.
  • parallel_iterator - Given a path in the JSON object, this iterator takes all the arrays which are children and maps the items at each index to a row in the table. This is particularly useful for working with links.

The iterators are powerful and allow you to take deep JSON structures and flatten them to table rows. Please check out the documentation for each iterator for a concrete example.

As a further example, the following line specifies mapping for VEDO tags that appear in the tag_tree property of the interaction:

[tag_labels :iter = leaf_iterator(interaction.tag_tree)]

Here we are mapping all leaves under interaction.tag_tree to a row in the tag_labels table.



The final feature I wanted to cover is conditions. These are really useful if you want to put data in different tables or columns depending on their data type.

Although this might sound unusual, returning to our example this is useful when dealing with tags and scores under the tag_tree path.

Under the mapping declaration for the tag_labels table, there is this line:

label = :iter._value (data_type: string, condition: is_string)

This states that a value should only be put in the table if the value is a string. You'll see a very similar line for the tag_scores table below, which does the same but insists on a float value. The result is that tags (which are text labels) will be stored in the tag_labels table, whereas scores (which are float values) will be stored in the tag_scores table.

That concludes our whirlwind tour of the features. Mapping files give you a comprehensive set of tools to map unstructured data to your relational database tables. With your mapping file created you can start pushing data to your database quickly and easily.

Summing Up...

This was quite a lengthy post, but hopefully it gave you an overview of the possibilities with the new MySQL destination. The key being that it makes it incredibly easy to push data reliably into your database. I've personally thrown away a lot of custom code I'd written to do the same job and now don't think twice about getting data into my databases.

To stay in touch with all the latest developer news please subscribe to our RSS feed at


Subscribe to Datasift Documentation Blog