Blog posts in Engineering

Richard Caudle's picture

Building Better Machine Learned Classifiers Faster with Active Learning

You might have seen our recent announcement covering many things including the announcement of VEDO Intent. You're probably aware that DataSift VEDO allows you to run machine-learned classifiers. Unfortunately creating a high-quality classifier relies on a good quantity and quality of manually classified training data (which can be a painstaking task to produce) and exploration of machine learning algorithms to get the best possible result. 

VEDO Intent is a tool that helps dramatically cut the time it takes to create a high-quality classifier. Firstly it reduces the time required to manually classify your training set, secondly it automates the process of exploring machine learning algorithms and hyperparameter optimisation, and finally it generates CSDL to use on our platform. Essentially the tool allows you to build high-quality classifiers with no data science experience.

VEDO Intent will be available to customers soon. In this post we'll take a first look at how the tool uses various data science techniques to help you build training data and generate classifiers

Employing Active Learning to Help Build Training Sets

Naturally the accuracy of any machine learning is highly dependent on the size and variation of the training set you provide. When we first released VEDO we also released an open source library based on scikit-learn that allowed you to train linear classifiers, which you can run on our platform. To use this library you need to supply a training set of interaction that you've spent time manually assigning classes to - this can take some time if you want to produce a high quality classifier.

The goal of VEDO Intent is to greatly reduce the time you need to spend manually classifying interactions for your training set. The tool presents to you the interactions it is least certain of, so you only need to spend time classifying the most significant interactions which will improve your classifier.

When you use the tool you are asked to label interactions from your sample data set. The tool presents a few interactions at a time for you to inspect.

Initially the tool presents interactions randomly until it has enough to generate a first model - at least 20 interactions for each class you've specified.

Next, the tool uses your input so far to train a classification model. The model represents a hyperplane in multidimensional space. Interactions that are closest to the hyperplane are the interactions that the tool is least confident of predicting classes for. By asking you to manually classify these interactions the tool knows it will greatly improve the model's accuracy.

The tool will continue to ask you to classify interactions to improve the model until you're happy with the predicted accuracy and it has a minimum number of interactions to train a complete classifier.

Improving Results with Grid Search & Cross Validation

When you've completed your manual labelling of interactions you can generate a full classifier which can be run on our platform.

Our open source library requires you to have a good foundation of data science so that you can test and improve your classifier. VEDO Intent carries out this optimisation for you.

The tool carries out a grid search across feature extraction options, feature selection strategies,  and different machine learning algorithms (including support vector machines and naive bayes) and the associated hyperparameters.

The tool also carries out cross-validation across the grid search selecting the best combination of all the potential options to give you the highest quality classifier.

Using VEDO Intent

When using the tool you carry out the following steps.

1. Uploading sample data

The first step when using the tool is to upload sample data to work with. The data needs to be representative of the data you're looking to classify in future. 

Data is uploaded by dragging a file into the space at the top of the tool. Data needs to be in JSON-newline-delimited format. If you've used our platform before this is how we deliver raw data to you. For instance you could run a historic task to extract data then upload the delivered data to this tool.

2. Deciding on classes

Next the tool will let you specify classes (or labels) you'd like to categorize your data into. 

Deciding on the right classes is very important. You can read more in our previous post about machine learning in Step 1.

3. Labelling interactions

At this point you will be asked to start manually labelling your data. The tool will take you through the process described above which uses active learning techniques. 

For each presented interaction you need to either select a label or skip the interaction. You can also flag interactions to revisit them later if you're unsure what label to assign.

Aside from assigning classes to individual interactions you can also provide hints to the tool by specifying keywords for classes. The tool will use these hints when it next builds a model.

4. Building a classifier

Once you've labelled enough interactions you can train your first classifier. 

The tool will take some time to run this process as it carries out the grid search discussed above. When the grid search is complete the tool will return with the full classification results as well as a CSDL hash you can immediately use on the platform.

5. Reviewing results

The results view allows you to fully assess the quality of the classifier produced. 

The results include standard accuracy measurements (agreement, F1 score, precision, recall)  for the classifier comparing the training set you labelled to predicted results. You can also review the confusion matrix for the classifier.

In addition you can download the 'mismatches' file to see where the prediction model succeeded and failed.

Based on the results you could at this point choose to label more data or revisit the labels and features you input to try to improve your classifier.

6. Running the classifier on the platform

Training a classifier results in CSDL code which is compiled for you on the platform, giving you a hash that you can run on the platform by including it in any filter you choose. 

tags "68ca0c21208e02f996d4ca9e3e10d8f8" \n \nreturn \n{ \n \tinteraction.content contains_any "airways, flight, luggage, departure" \n \tAND language.tag == "en" AND language.confidence >= 70 \n}

You can then make use of these classes in your analysis.


Richard Caudle's picture

Exploring Keyword Relationships in Social Posts with word2vec

You might have seen our recent announcement covering many things including the launch of our new FORUM data science initiative. We're looking to share more of our experience and research to help innovation in social data analysis.

One of the first things we wanted to share is our work exploring the relationships between keywords and terms in social posts. Our data science team has been researching this area using word2vec - a data science library that models the relationship / similarity between words. 

Using word2vec we've started to create models that are trained from large numbers of historical social posts. We're calling these Keyword Relationship Models and have released our first today. You can explore the model using our Keyword Relationship Explorer tool which we announced in another post.

These models are extremely useful as they reflects how language and keywords are really used by people. You can use such models to improve your filters, classification rules and analysis by making sure you include the key terms, phrases and misspellings used around your topic and keep track of them as they evolve over time.

In this post we thought we'd share more detail on how we trained the underlying model.

Challenges of working with keywords

Identifying and expanding on keywords and terms is a key challenge when filtering, classifying and analyzing text data. 

The two key challenges you face when working with keywords are:

  • Coverage - it’s not easy to get a comprehensive list of all the keywords representing a concept. You need to consider variations, synonyms, misspells, new terms, slang and so on.
  • Noise - often keywords may have multiple meanings and therefore their ambiguity could introduce irrelevant data in the classification. This leads to false positives in your results.

By training Keyword Relationship Models we can tackle these challenges as by starting from a seed term such as 'coffee' we can easily learn similar terms to include (such as cappuccino, caffeine, cafe) but also learn which words to exclude (such as bean, plant, roasting) - of course these depend on your use case.

How we build Keyword Relationship Models

A Keyword Relationship Model is based upon word2vec, an open-source library that uses deep learning to represent words and n-grams as vectors in a multidimensional space. Essentially word2vec processes raw text and produces a model which is a representation of words in a vector space. The highly dimensional space created gives words a concept of similarity.

We created the example model through these following steps.

1) Retrieved historic data

We pulled together 2 months of posts in English from our historic archives of social network traffic.

In total this set of interactions was almost 4 billion posts.

2) Pre-processed content

Next we pre-processed content in order to standardize tokens.

Pre-processing steps included:

  • Removing non-ascii characters
  • Lower-cased all content
  • Separated consecutive hashtags and mentions with a white space
  • Replaced newlines with whitespaces
  • Replace &, & with " and "
  • Replaced other HTML entities with a white space
  • Removed hyphens within words
  • Removed single quotes within words
  • Replaced single and consecutive punctuation characters with a single white space
  • Collapsed whitespaces - consecutive whitespaces merged into a single one
  • Alphanumeric strings are kept
  • Hashtags are kept
  • Replaced values for percentage, currency, time, date, mentions, numbers, links with unique placeholder tokens
  • Removing duplicate posts

We then applied tokenization to the cleansed content based on whitespace.

For example if the original piece of content read as follows:

Tom I Love this new #smartphone :)!, the one I found costs 300$ here

After pre-processing we arrived at:

[“Tom”,”i”,“love”,”the”,”new”,”#smartphone”,”the”,”one”,”i”,”found”,”costs”,”PRICE”, ”here”,”LINK”]

3) Processed dataset using word2vec

After these steps the corpus was processed using Word2Vec (skip-gram version and 400 dimensions) to obtain the final model. Terms with less than 40 occurrences were not included while all the others, a total of over 3M, are part of the model including stopwords and placeholder tokens.

4) Queried word2vec model

Once the model was trained we could then query the model for similar words based on a seed term. For example in Python:

model = word2vec.Word2Vec.load_word2vec_format('<path to model file>', binary=True)
similar_words = model.most_similar(positive=['#nfl'], topn=10)
print similar_words

This is how our explorer tool was built.

What's next?

You can start exploring our first model straight away using our Keyword Relationship Explorer tool

Watch this space as we'll be releasing more models in future and giving you more ways to use them in your own projects.

Why not register your interest in our FORUM initiative so you learn more about bringing data science to your projects?

Richard Caudle's picture

How To Update Filters On-The-Fly And Build Dynamic Social Solutions

It would be easy if the world around us was static, but in practice things are always changing. Nowhere is this truer than in the world of social networks; users are constantly following new friends and expressing new thoughts. The filter you wrote yesterday is probably already out-of-date! 
On the DataSift platform you can update your filters on the fly via the API and avoid downtime for your application. This not only allows you to adapt to real-world changing scenarios, but in fact allows you to build much more powerful, dynamic social solutions. In this post I'll show you how this can be done.

Embracing Change

If you've ever built a software solution you'll know that things aren't quite as simple as you'd hope. The real world is always changing. 
For example imagine you're tracking conversation around a news story. You build a simple filter which looks for the terms and personalities involved in the story. This works great, but a few hours later the story has evolved. As stories evolve it is inevitable that the terms people are using to discuss it change. You'll need to react to this without missing any conversations.
Or, maybe you've built an awesome social app that allows users to input their interests and you're creating a filter from input. The next day the user updates their interests. You'll need to update your filter to the new interests without interrupting your service to the user.
A well-designed solution takes change in it's stride. 

Broad Brush Overview

Ok, so we want to build our wonderfully dynamic, super-duper-flexible social solution. What does this mean in practice? On the DataSift side of things we want to be able to update our streams (filtering & tagging) definitions on-the-fly, delivering data to the same destination, without missing any data.
Before we get to the deeper details, the broad principles are:
  • Create V1 of our stream: Build V1 of our stream definition, for instance from user input
  • Start consuming V1: Compile and stream V1 of our stream as usual via the API
  • Create V2 of our stream: Something has changed! Build V2 of our stream to adapt.
  • Start consuming V2: In parallel with streaming V1, we'll start streaming V2 of our stream.
  • Stop consuming V1: When we're happy V2 is streaming nicely, we'll stop streaming V1.
Essentially to avoid downtime (or missing data) we have a brief period where we're streaming both versions in parallel. Note we will need to handle de-duplication during this brief period. 

Let's Do It

Ok, so that's the principles explained. Let's see this in practice.
I wrote a stream last week to track conversations around popular games. Let's use this as an example. 
(For the complete example code take a look at this GIST.)

Create Stream V1

Version 1 of our stream will look for mentions of five popular games; 2048, Farmville 2, Swamp Attack, Trials Frontier and Don't Step The White Tile.
Note this is a simple illustrative example. In practice you might want to look for mentions by inspecting links being shared for instance.

Start Consuming V1

Now that we have our stream defined, we can compile the definition and start consuming data. In this example we'll use the Pull destination to get resulting data.
For this example I'll use the Python helper library.

Create Stream V2

We're now happily consuming data. But wait! There's a new game that's entered the charts that we must track. The game is Clash of the Clans, and it must be added to our filter.
It's easy to imagine you could generate such a filter from an API which gives you the latest game charts.
The updated filter looks as follows (notice the use of the contains_near operator to tolerate missing words from the title):

Start Consuming V2

The next step is to start streaming V2 of the stream in parallel with V1. 

De-duplicating Data

We now have two streams running in parallel. Until we stop stream 1 there's a good chance that the same interaction might be received on both streams, so it's important we de-duplicate the data received. 
How you go about this completely depends on the solution being built. Whatever way you choose you can use the property of the interaction as a unique identifier. One way would be to have a unique key in a database (if this is where your data is being stored),  another simple way would to have a rolling in-memory list of IDs say for the last 5 minutes. Of course this decision depends on the volume of data you expect and the scale of your solution.

Stop Consuming V1

Now that  we have started streaming V2 of the stream we can stop consuming data from V1. 
When you start the second stream it will start immediately. However, if you want to be doubly sure that you do not miss any data we recommend that you wait for the first interaction from stream V2 to be received before stopping stream V1. Note that the platform will charge you for DPUs consumed and data received for each stream individually.

In Conclusion

And so ends my quick tour. I hope this post illustrates how you can switch to new stream definitions on the fly. This capability is likely to be key to real-world solutions you create, and hopefully inspires you to create some truly responsive applications.
For the complete example code take a look at this GIST.
To stay in touch with all the latest developer news please subscribe to our RSS feed at
And, or follow us on Twitter at @DataSiftDev.


Jason's picture

Facebook Pages Managed Source Enhancements

Taking into account some great customer feedback, on May 1st, 2014 we released a number of minor changes to our Facebook Pages Managed Source. 

Potential Breaking Changes

Facebook Page Like and Comment Counts have been Deprecated

The facebook_page.likes_count and facebook_page.comment_count fields have been deprecated from DataSift's output. We found this data became outdated quickly; a better practice for displaying counts of likes and comments in your application is to count like and comment interactions as you receive them. 

Format for facebook_page.message_tags has Changed

facebook_page.message_tags fields were previously in two different formats dependant on whether they came from comments, or posts. This change ensures that all message_tags are provided in a consistent format; as a list of objects. An example of the new consistent format can be seen below:
Please ensure that if your application utilizes these fields, it can handle them as a list of objects.

New Output Fields

We have introduced a number of new output fields in interactions from the Facebook Pages Managed Source. You will be able to filter on many of these fields.

New “Page Like” Interactions

By popular request, we have introduced a new interaction with the subtype “page_like” for anonymous page-level likes.
This should now allow you to track the number of likes for a given page over time.
This subtype has two fields, `current_likes` and `likes_delta`. The first represents the current number of likes for a Facebook Page at the time of retrieval. The second represents the difference with the previously retrieved value. We only generate interactions of this type if `likes_delta` is not zero.  Also note that `likes_delta` can be negative, when the number of unlikes is greater than the number of likes between two retrievals.
This interaction type should allow visualizing page likes as a time series. In addition, filters on `likes_delta` could be used to detect trending pages.

‘from' Fields now Include a Username Where Available

Where it is provided to us, .from fields in Facebook Pages interactions now contain a .username field.
Please note that in some cases, this field is not returned by Facebook.

New Comment ‘Parent' Field

Objects of type comment include an optional .parent object, which contains a reference to a parent comment. The object structure is self-similar.
This will allow you to tell whether comments are nested or not, and associate them with a parent comment if so.

New ‘From’ Field in Post Objects

Objects of type comment/like include an additional .from field in their .post context object, which contains information about the author of the post they are referring to.

New CSDL Targets

We have introduced 12 new Facebook Pages targets. This includes targets to allow you to filter on the likes count of a page, the parent post being commented on, a Facebook user's username, and more. These new targets can all be found in our Facebook Pages targets documentation.

Other Changes

New Notifications for Access Token Issues

If a case occurs where all tokens for a given source have permanent errors, the source will become “disabled", and you will receive a notification. You should then update the source with new tokens, and restart it. 
Note that every error will also be present in the /source/log for that Managed Source.

Summary of Changes

  • facebook_page.likes_count and facebook_page.comment_count fields will be deprecated from DataSift's output
  • The facebook_page.message_tags output field format is changing to become a list of objects
  • We are introducing a new interaction with the subtype “page_like” for anonymous page-level likes
  • .from fields in Facebook Pages interactions now contain a .username field where available
  • Comment interactions will now include a parent object, referencing the parent comment
  • We are introducing a .from field to Facebook Pages .post objects, containing information about the post author
  • We are introducing a number of new CSDL targets for Facebook Pages
  • You will receive better notifications about issues with your Facebook Access Tokens
Richard Caudle's picture

Platform Updates - Content Age Filtering, Larger Compressed Data Deliveries

This is a quick post to update you on some changes we've introduced recently to help you work with our platform and make your life a little easier.

Filtering On Content Age

We aim to deliver you data as soon as we possibly can, but for some sources there can be a delay between publication to the web and our delivery which is out of our control.
In most cases this does not have an impact, but in some situations (perhaps you only want to display extremely fresh content to a user) this is an issue.
For these sources we have introduced a new target, .age, which allows you to specify the maximum time since the content was posted. For instance if you want to filter on blog posts mentioning 'DataSift', making sure that you only receive posts published within the last hour:
blog.content contains "DataSift" AND blog.age < 3600
This new target applies to the Blog, Board, DailyMotion, IMDB, Reddit, Topix, Video and YouTube sources.

Push Destinations - New Payload Options

Many of our customers are telling us they can take much larger data volumes from our system. We aim to please, so have introduced options to help you get more data quicker.

Increased Payload Sizes

To enable you to receive more data quicker from our push connectors, we have upped the maximum delivery sizes for many of our destinations. See the table below for the new maximum delivery sizes.

Compression Support

As the data we deliver to you is text, compression can be used to greatly reduce the size of files we deliver, making transport far more efficient. Although compression rates do vary, we are typically seeing an 80% reduction in file size with this option enabled.
We have introduced GZip and ZLib compression to our most popular destinations. You can enable compression on a destination by selecting the option in your dashboard, or by specifying the output_param.compression parameter through the API.
When data is delivered you can tell it has been compressed in two ways:
HTTP destination: The HTTP header 'X-DataSift-Compression' will have the value none, zlib or gzip as appropriate
S3, SFTP destinations: Files delivered to your destination will have an addition '.gz' extension is they have been compressed, for example DataSift-xxxxxxxxxxxxxxxxxxx-yyyyyyy.json.gz
Here's a summary of our current push destinations support for these features.
Destination Maximum Payload Size Compression Support
HTTP 200 MB GZip, ZLib
S3 200 MB GZip
CouchDB 50 MB  
ElasticSearch 200 MB  
FTP 200 MB  
MongoDB 50 MB  
MySQL 50 MB  
PostgreSQL 50 MB  
Pull 50 MB  
Redis 50 MB  
Splunk 50 MB  

Stay Up-To-Date

To stay in touch with all the latest developer news please subscribe to our RSS feed at
And, or follow us on Twitter at @DataSiftDev