Blog posts in Engineering

Richard Caudle's picture

How To Update Filters On-The-Fly And Build Dynamic Social Solutions

It would be easy if the world around us was static, but in practice things are always changing. Nowhere is this truer than in the world of social networks; users are constantly following new friends and expressing new thoughts. The filter you wrote yesterday is probably already out-of-date! 
On the DataSift platform you can update your filters on the fly via the API and avoid downtime for your application. This not only allows you to adapt to real-world changing scenarios, but in fact allows you to build much more powerful, dynamic social solutions. In this post I'll show you how this can be done.

Embracing Change

If you've ever built a software solution you'll know that things aren't quite as simple as you'd hope. The real world is always changing. 
For example imagine you're tracking conversation around a news story. You build a simple filter which looks for the terms and personalities involved in the story. This works great, but a few hours later the story has evolved. As stories evolve it is inevitable that the terms people are using to discuss it change. You'll need to react to this without missing any conversations.
Or, maybe you've built an awesome social app that allows users to input their interests and you're creating a filter from input. The next day the user updates their interests. You'll need to update your filter to the new interests without interrupting your service to the user.
A well-designed solution takes change in it's stride. 

Broad Brush Overview

Ok, so we want to build our wonderfully dynamic, super-duper-flexible social solution. What does this mean in practice? On the DataSift side of things we want to be able to update our streams (filtering & tagging) definitions on-the-fly, delivering data to the same destination, without missing any data.
Before we get to the deeper details, the broad principles are:
  • Create V1 of our stream: Build V1 of our stream definition, for instance from user input
  • Start consuming V1: Compile and stream V1 of our stream as usual via the API
  • Create V2 of our stream: Something has changed! Build V2 of our stream to adapt.
  • Start consuming V2: In parallel with streaming V1, we'll start streaming V2 of our stream.
  • Stop consuming V1: When we're happy V2 is streaming nicely, we'll stop streaming V1.
Essentially to avoid downtime (or missing data) we have a brief period where we're streaming both versions in parallel. Note we will need to handle de-duplication during this brief period. 

Let's Do It

Ok, so that's the principles explained. Let's see this in practice.
I wrote a stream last week to track conversations around popular games. Let's use this as an example. 
(For the complete example code take a look at this GIST.)

Create Stream V1

Version 1 of our stream will look for mentions of five popular games; 2048, Farmville 2, Swamp Attack, Trials Frontier and Don't Step The White Tile.
Note this is a simple illustrative example. In practice you might want to look for mentions by inspecting links being shared for instance.

Start Consuming V1

Now that we have our stream defined, we can compile the definition and start consuming data. In this example we'll use the Pull destination to get resulting data.
For this example I'll use the Python helper library.

Create Stream V2

We're now happily consuming data. But wait! There's a new game that's entered the charts that we must track. The game is Clash of the Clans, and it must be added to our filter.
It's easy to imagine you could generate such a filter from an API which gives you the latest game charts.
The updated filter looks as follows (notice the use of the contains_near operator to tolerate missing words from the title):

Start Consuming V2

The next step is to start streaming V2 of the stream in parallel with V1. 

De-duplicating Data

We now have two streams running in parallel. Until we stop stream 1 there's a good chance that the same interaction might be received on both streams, so it's important we de-duplicate the data received. 
How you go about this completely depends on the solution being built. Whatever way you choose you can use the property of the interaction as a unique identifier. One way would be to have a unique key in a database (if this is where your data is being stored),  another simple way would to have a rolling in-memory list of IDs say for the last 5 minutes. Of course this decision depends on the volume of data you expect and the scale of your solution.

Stop Consuming V1

Now that  we have started streaming V2 of the stream we can stop consuming data from V1. 
When you start the second stream it will start immediately. However, if you want to be doubly sure that you do not miss any data we recommend that you wait for the first interaction from stream V2 to be received before stopping stream V1. Note that the platform will charge you for DPUs consumed and data received for each stream individually.

In Conclusion

And so ends my quick tour. I hope this post illustrates how you can switch to new stream definitions on the fly. This capability is likely to be key to real-world solutions you create, and hopefully inspires you to create some truly responsive applications.
For the complete example code take a look at this GIST.
To stay in touch with all the latest developer news please subscribe to our RSS feed at
And, or follow us on Twitter at @DataSiftDev.


Jason's picture

Facebook Pages Managed Source Enhancements

Taking into account some great customer feedback, on May 1st, 2014 we released a number of minor changes to our Facebook Pages Managed Source. 

Potential Breaking Changes

Facebook Page Like and Comment Counts have been Deprecated

The facebook_page.likes_count and facebook_page.comment_count fields have been deprecated from DataSift's output. We found this data became outdated quickly; a better practice for displaying counts of likes and comments in your application is to count like and comment interactions as you receive them. 

Format for facebook_page.message_tags has Changed

facebook_page.message_tags fields were previously in two different formats dependant on whether they came from comments, or posts. This change ensures that all message_tags are provided in a consistent format; as a list of objects. An example of the new consistent format can be seen below:
Please ensure that if your application utilizes these fields, it can handle them as a list of objects.

New Output Fields

We have introduced a number of new output fields in interactions from the Facebook Pages Managed Source. You will be able to filter on many of these fields.

New “Page Like” Interactions

By popular request, we have introduced a new interaction with the subtype “page_like” for anonymous page-level likes.
This should now allow you to track the number of likes for a given page over time.
This subtype has two fields, `current_likes` and `likes_delta`. The first represents the current number of likes for a Facebook Page at the time of retrieval. The second represents the difference with the previously retrieved value. We only generate interactions of this type if `likes_delta` is not zero.  Also note that `likes_delta` can be negative, when the number of unlikes is greater than the number of likes between two retrievals.
This interaction type should allow visualizing page likes as a time series. In addition, filters on `likes_delta` could be used to detect trending pages.

‘from' Fields now Include a Username Where Available

Where it is provided to us, .from fields in Facebook Pages interactions now contain a .username field.
Please note that in some cases, this field is not returned by Facebook.

New Comment ‘Parent' Field

Objects of type comment include an optional .parent object, which contains a reference to a parent comment. The object structure is self-similar.
This will allow you to tell whether comments are nested or not, and associate them with a parent comment if so.

New ‘From’ Field in Post Objects

Objects of type comment/like include an additional .from field in their .post context object, which contains information about the author of the post they are referring to.

New CSDL Targets

We have introduced 12 new Facebook Pages targets. This includes targets to allow you to filter on the likes count of a page, the parent post being commented on, a Facebook user's username, and more. These new targets can all be found in our Facebook Pages targets documentation.

Other Changes

New Notifications for Access Token Issues

If a case occurs where all tokens for a given source have permanent errors, the source will become “disabled", and you will receive a notification. You should then update the source with new tokens, and restart it. 
Note that every error will also be present in the /source/log for that Managed Source.

Summary of Changes

  • facebook_page.likes_count and facebook_page.comment_count fields will be deprecated from DataSift's output
  • The facebook_page.message_tags output field format is changing to become a list of objects
  • We are introducing a new interaction with the subtype “page_like” for anonymous page-level likes
  • .from fields in Facebook Pages interactions now contain a .username field where available
  • Comment interactions will now include a parent object, referencing the parent comment
  • We are introducing a .from field to Facebook Pages .post objects, containing information about the post author
  • We are introducing a number of new CSDL targets for Facebook Pages
  • You will receive better notifications about issues with your Facebook Access Tokens
Richard Caudle's picture

Platform Updates - Content Age Filtering, Larger Compressed Data Deliveries

This is a quick post to update you on some changes we've introduced recently to help you work with our platform and make your life a little easier.

Filtering On Content Age

We aim to deliver you data as soon as we possibly can, but for some sources there can be a delay between publication to the web and our delivery which is out of our control.
In most cases this does not have an impact, but in some situations (perhaps you only want to display extremely fresh content to a user) this is an issue.
For these sources we have introduced a new target, .age, which allows you to specify the maximum time since the content was posted. For instance if you want to filter on blog posts mentioning 'DataSift', making sure that you only receive posts published within the last hour:
blog.content contains "DataSift" AND blog.age < 3600
This new target applies to the Blog, Board, DailyMotion, IMDB, Reddit, Topix, Video and YouTube sources.

Push Destinations - New Payload Options

Many of our customers are telling us they can take much larger data volumes from our system. We aim to please, so have introduced options to help you get more data quicker.

Increased Payload Sizes

To enable you to receive more data quicker from our push connectors, we have upped the maximum delivery sizes for many of our destinations. See the table below for the new maximum delivery sizes.

Compression Support

As the data we deliver to you is text, compression can be used to greatly reduce the size of files we deliver, making transport far more efficient. Although compression rates do vary, we are typically seeing an 80% reduction in file size with this option enabled.
We have introduced GZip and ZLib compression to our most popular destinations. You can enable compression on a destination by selecting the option in your dashboard, or by specifying the output_param.compression parameter through the API.
When data is delivered you can tell it has been compressed in two ways:
HTTP destination: The HTTP header 'X-DataSift-Compression' will have the value none, zlib or gzip as appropriate
S3, SFTP destinations: Files delivered to your destination will have an addition '.gz' extension is they have been compressed, for example DataSift-xxxxxxxxxxxxxxxxxxx-yyyyyyy.json.gz
Here's a summary of our current push destinations support for these features.
Destination Maximum Payload Size Compression Support
HTTP 200 MB GZip, ZLib
S3 200 MB GZip
CouchDB 50 MB  
ElasticSearch 200 MB  
FTP 200 MB  
MongoDB 50 MB  
MySQL 50 MB  
PostgreSQL 50 MB  
Pull 50 MB  
Redis 50 MB  
Splunk 50 MB  

Stay Up-To-Date

To stay in touch with all the latest developer news please subscribe to our RSS feed at
And, or follow us on Twitter at @DataSiftDev
Hiroaki Watanabe's picture

Using Japanese Tokenization To Generate More Accurate Insight

At the heart of DataSift’s social data platform is a filtering engine that allows companies to target the text, content and conversations that they want to extract for analysis. We are proud to announce that we have expanded our platform to include Japanese, one of the fastest growing international markets for Twitter.

Principles Of Tokenization

This provides new challenges for how we can accurately filter to identify and extract relevant content and conversations. The main challenge to overcome is that Japanese, unlike Western languages, is written without the word boundaries (i.e. whitespace).
Imagine tackling this challenge in English to create a meaningful sentence from the sequence of characters from the first sentence of Lewis Carroll’s "Alice's Adventures in Wonderland".
asreading,butithadnopicturesorconversationsinit,'and whatistheuseof
You may find it easy to complete this task, but two important essences of Natural Language Processing (NLP) are involved in this exercise. From an algorithmic point of view:
  • Once we have options for where word-boundaries sit (Ali? Alice?, Alicew?), the number of possibilities could exponentially increase in the worst case and
  • Numerical score may help to rank the possible outcomes.
Let us see how these two points are relevant to Japanese Tweets. The following five characters form a popular sentence that can be tokenized into the two blocks of characters with meaning:
まじ面白い    == (tokenization) ==>     まじ  面白い
in which a white space is inserted between “じ” and “面”. In NLP, this process is called “tokenization” or "word chunking".
The meaning of this sentence is “seriously (まじ) interesting (面白い)". The first two characters, まじ, represent a popular slang often attached to sentimental words. Although “まじ” is a good indicator for sentiment, we can find them in other common words  (e.g., おまじない [good luck charm], すさまじい[terrible])) where the meaning of “まじ” (seriously) is no longer present. 
This simple Japanese case study highlights that:
  • You cannot apply a simple string searching algorithm for searching keywords (i.e. search for the sub-string (まじ) within the text as it can easily introduce errors )
  • The decision whether or not to tokenize can be affected by surrounding characters.


Approaches For Japanese Tokenization

In industry, there are two main approaches to solve this tokenization problem: (a) Morphological analysis and (b) N-gram. The N-gram approach generates blocks of characters systematically "without" considering the meanings from training examples and generates numerical scores by counting the frequency of each block. Because of this brute-force approach, the processing speed could be slow with large memory usage, however, it is strong for handling new “unknown words” since we do not need any dictionary.
In Datasift's platform, we implemented the Morphological approach for Japanese tokenization since it has advantages in terms of “speed” and “robustness for noise”. One drawback of the standard Morphological approach is its difficulty for handling unknown “new words”. Imagine the case where you see an unknown sequence of characters in the ‘Alice’ example.
Our software engineers have provided a great solution for this “new words” issue by twisting the standard Morphological approach. Thanks to our new algorithm, we successfully provide Japanese language service accurately for noisy Japanese Tweets without updating dictionary.

Putting It Into Practice: Tips For Japanese CSDL

If you are familiar with our filtering language (CSDL), you can apply our new Japanese tokenizer by simply adding a new modifier, [language(ja)], as follows:
interaction.content contains [language(ja)] "まじ" and
interaction.content contains [language(ja)] "欲しい"
Note that “欲しい” is “want” in English.
You can mix Japanese and other languages as well:
interaction.content contains [language(ja)] "ソニー" or
interaction.content contains "SONY"
Note that the keyword “ソニー” is analyzed using our Japanese tokenizer whereas our standard tokenizer is applied for the keyword “SONY” in this example.
Tagging (our rules-based classifier) also works for Japanese:
Note that the first two lines contains sentiments: “うれしい(happy)”, “楽しい(fun)”, “悲しい(sad)” and “楽しくない(sad)”.
Currently we support two main operators, “contains” and “contains_any”, for the [language(ja)] modifier. Our “substr” operator also works for Japanese although it may cause some noises as I explained above:
interaction.content substr "まじ"

Advanced Filtering - Stemming

An advanced tip to increase the number of filtering results is to consider the “inflection” of the Japanese language. Since Japanese is an agglutinative language, stems of words appear more often in Tweets. Our Morphological approach allows us to use “stem” as a keyword.
For example, the following CSDL could find tweets with “欲しい”, “欲しすぎて”, “欲しー”. :
interaction.content contains [language(ja)] "欲し"
It’s worth mentioning that there is no perfect solution for tokenization at the moment; N-gram approach has weakness for noise whereas the Morphological approach may not understand some of new words. If you find that a filter produces no output, you may try our “substr” operator which is our implementation of “string search algorithm”.
The above tagging example can be converted in a version that uses “substr” as follows:

Working Example For Japanese Geo-Extraction

Extracting users’ geological information is an interesting application. The following CSDL allows you to tag your filtered results using geo information, Tokyo (東京).
Note that “まじ” is used as a keyword for filtering in this example.

In Summary

  • Tokenization is an important technique to extract correct signals from East Asian languages.
  • N-gram and Morphological analysis are the two main techniques available.
  • Datasift has implemented a noise-tolerant Morphological approach for Japanese with some extensions to handle new words accurately.
  • By adding our new modifier [language(ja)] in CSDL, you can activate our Japanese tokenization engine in our distributed system.
  • We can mix Japanese and other languages within a CSDL filter to realize unified and centralized data analysis. 
Richard Caudle's picture

Introducing The MySQL Destination - Integrate Data Effortlessly Into Your Enterprise Solution

One key challenge for developer creating a solution is integrating, often many, data sources. DataSift destinations take away this headache, especially the recently released MySQL destination.

The MySQL destination allows you to map and flatten unstructured data to your database schema, avoid writing needless custom integration code and handles realtime delivery challenges such as dropped connections so you don't have to.

Relieving Integration Challenges

The DataSift platform offers many awesome places to push data, but often let's face it, we all like to see data in a good old fashioned database. Relational databases such as MySQL are still the backbone of enterprise solutions.

Receiving a stream of unstructured data, structuring it, then pushing the data into a relational database can cause a number of headaches. The new MySQL destination makes the job straight forward so that you can concentrate on getting maximum value out of your data. It provides the following features:

  • Guaranteed delivery - Data delivery is buffered and caters for dropped connections and delivery failure
  • Delivery control - Data delivery can be paused and resumed as you require under your control
  • Data mapping - Specify precisely how you want fields (within each JSON object) to be mapped to your MySQL schema

These features combined make pushing data from DataSift into a MySQL database extremely easy.

The MySQL Destination

As with any other type of destination, the easiest way to get started is to go to the Destinations page. Choose to add a new MySQL destination to your account.

Note that the MySQL destination is only currently available to enterprise customers. Contact your sales representative or account manager if you do not see the destination listed in your account.


To set up the destination you need to enter a name, the host and port of your MySQL server, the destination database schema and authentication details.

You also need to provide a mappings file. This file tells the destination which fields within the JSON data you would like to be mapped to tables in your database schema. More details on this in a minute.

It's worth using the Test Connection button as this will check that your MySQL server is accessible to our platform, the database exists, the security credentials are valid and that the mapping file is valid.

Note that you can also create the destination via our API. This process is documented here.

Mapping Data To A Schema

The basic connection details above are self-explanatory, but the mapping file definitely needs a little more explanation. There are many things to consider when mapping unstructured data to a relational set of tables.

Let me take you through an example schema and mapping file to help clarify the process. These have been designed to work with Twitter data. The two files I'll be discussing are:

MySQL Schema

In the example schema the following tables are included, which give us a structure to store the tweets.

  • interaction - Top-level properties of each interaction / tweet. All tables below reference interactions in this table.
  • hashtags - Hashtags mentioned for each interaction
  • mentions - Twitter mentions for each interaction
  • links - Links for each interaction
  • tag_labels - VEDO tags for each interaction
  • tag_scores - VEDO scores for each interaction

The example schema is quite exhaustive, please don't be put off! You can more than likely use a subset of fields and tables to store the data you need for your solution. You might also choose to write views that transform data from these tables to fit your application.

Now's not the time to cover MySQL syntax, I'm sure if you're reading this post you'll be used to creating schemas. Instead I'll move on to the mapping file, which is where the magic lies.

Mapping File

The mapping file allows you to specify what tables, columns and data types the raw data should be mapped to in your schema. I can't cover every possibility in one post, so for full details see our technical documentation pages. To give you a good idea though, I'll pick out some significant lines from the example mapping file.

Let's pretend we have the following interaction in JSON (I removed many fields for brevity):


Tables, Datatypes & Transforms

The first line tells the processor you want to map the following columns of the 'interaction' table to fields in the JSON structure.


The next line, tells the processor to map the path to the interaction_id column of the table:

interaction_id =

Skipping a couple of lines, the following tells the processor to map interaction.created_at to the created_at column. You'll notice though that we have additional data_type and transform clauses.

created_at = interaction.created_at (data_type: datetime, transform: datetime)

If you don't explicitly specify a data_type then the processor will attempt to decide the best type for itself by inspecting the data value. In the majority of cases this is perfectly ok, but in this line we ensure that the type is a datetime.

The transform clause gives you access to some useful functions. Here we are using the datetime function to cast the string value in the data to a valid datetime value.

Later on for the same table you'll see this line which uses a different transform function:

is_retweet = (data_type: integer, transform: exists)

Here the function will return true if the JSON object has this path present, otherwise it will return false.



Now let's move down to the hashtags table mapping. You'll see this as the first line:

[hashtags :iter = list_iterator(interaction.hashtags)]

This table mapping uses an iterator to map the data from an array to rows in a table. The line specifies that any items within the interaction.hashtags array should each be mapped to one row of the hashtags table. For our example interaction, a row would be created for each of 'social' and 'marketing'.

Note that we can refer to the current item in the iterator by using the :iter variable we set in the table mapping declaration:

hashtag = :iter._value

Here _value is a reserved property representing the value of the item in the array. You can also access _path which is the relative path within the object of the value. If we were using a different type of iterator, for example over an array of objects we could reference properties of the current object, such as

There are a number of iterators you can use to handle different types of data structure:

  • list_iterator - Maps an array of values at the given path to rows of a database table.
  • objectlist_iterator - Like list_iterator, but instead is used to iterate over an array of objects, not simple values.
  • path_iterator - Flattens all properties inside an object, and it's sub objects, to give you a complete list of properties in the structure.
  • leaf_iterator - Like path_iterator, however instead of flattening object properties, instead flattens any values in arrays within the structure to one complete list.
  • parallel_iterator - Given a path in the JSON object, this iterator takes all the arrays which are children and maps the items at each index to a row in the table. This is particularly useful for working with links.

The iterators are powerful and allow you to take deep JSON structures and flatten them to table rows. Please check out the documentation for each iterator for a concrete example.

As a further example, the following line specifies mapping for VEDO tags that appear in the tag_tree property of the interaction:

[tag_labels :iter = leaf_iterator(interaction.tag_tree)]

Here we are mapping all leaves under interaction.tag_tree to a row in the tag_labels table.



The final feature I wanted to cover is conditions. These are really useful if you want to put data in different tables or columns depending on their data type.

Although this might sound unusual, returning to our example this is useful when dealing with tags and scores under the tag_tree path.

Under the mapping declaration for the tag_labels table, there is this line:

label = :iter._value (data_type: string, condition: is_string)

This states that a value should only be put in the table if the value is a string. You'll see a very similar line for the tag_scores table below, which does the same but insists on a float value. The result is that tags (which are text labels) will be stored in the tag_labels table, whereas scores (which are float values) will be stored in the tag_scores table.

That concludes our whirlwind tour of the features. Mapping files give you a comprehensive set of tools to map unstructured data to your relational database tables. With your mapping file created you can start pushing data to your database quickly and easily.

Summing Up...

This was quite a lengthy post, but hopefully it gave you an overview of the possibilities with the new MySQL destination. The key being that it makes it incredibly easy to push data reliably into your database. I've personally thrown away a lot of custom code I'd written to do the same job and now don't think twice about getting data into my databases.

To stay in touch with all the latest developer news please subscribe to our RSS feed at