Richard Caudle's picture
Richard Caudle
Updated on Wednesday, 29 January, 2014 - 11:26

Recently I've covered the tagging and scoring features of DataSift VEDO. My post on scoring gave a top level overview and a simple example, but might have left you hungry for something a little more meaty. In this post I’ll take you through how we’ve started to build linear classifiers internally using machine learning techniques to tackle more complex classification projects. This post will explain what a linear classifier is, how it can help you and give you a method to get you started building your own.

What Is A Linear Classifier?

Until now you’re likely to have relied on boolean expressions to categorise your social data based on looking at data by eye. A linear classifier, made possible by our new scoring features, allows you to categorise data based on machine learned characteristics over much larger data sets.

A linear classifier is a machine learned method for categorising data. Machine learning is used to identify key characteristics in a training set of data and give each characteristic a weighting to reflect its influence. When the resulting classifier is run against new data each piece of data it is given a score for how likely the data is to belong in each category. The category with the highest score is considered the most appropriate category for the new piece of data.

Linear classifiers are not the most advanced or accurate method of classification, but they are a good match for high volume data sources due to their efficiency and so are perfect for social data. The accuracy of the classifier depends greatly on the definition of the categories, quality and size of the training set and effort to iteratively improve results through tuning.

For this post I will concentrate on how we built the customer service routing classifier in our library. This classifier is designed to help airlines triage incoming customer requests.

Development Environment

Before I start, we use Python for our data science development work. To make use of our scripts you’ll need the following set up:

The Process

To build a classifier you’ll need to carry out the following steps:

  1. Define the categories you want to classify data into and the data points you need to consider
  2. Gather raw data for the training set
  3. Manually classify the raw data to form the training set
  4. Use machine learning to identify characteristics and generate scoring rules
  5. Test the rule set and iterate to improve the classifier’s accuracy

Let’s look at each of these in detail.

1. Define Your Categories & Targets

The first thing you need to consider is what categories are you going to classify your data into. It is essential to invest time upfront considering the categories, and to write for each a strong definition and include a few examples. The more precise and considered you can be here, the more efficient the learning process can be and the more useful your classifier will become.

Make sure your categories are a good fit for their eventual use. You must make sure that no categories overlap and that you have categories so that all possible interactions are covered. So for example you might want to include an 'other' category as we did below.

For the airline classifier, we spent a good amount of time looking into the kind of conversations that surround airline customer services and were inspired by this Altimeter study. We wanted to demonstrate how conversations could be classified for handling by a customer services team.

Thee categories we finally decided on were:

  • Rant: An emotionally charged criticism that may damage brand image
    • “After tweeting negative comment about EasyJet, I have been refused boarding! My rights has been violated!!!”
  • Rave: Thanks or positive opinion about flight or services
    • “Landing during storm, saved by EasyJet pilot, thanks”
  • Urgency: Request for resolving a real-time issue, including compensation
    • “EasyJet Flight cancelled. I demand compensation now!”
  • Query: A polite or neutral question about how to contact the company, use the website, print boarding card etc.
    • “Where can I find EasyJet hand luggage dimensions?”
  • Feedback: Statement about the flight or service, relating to how it could be improved, including complaints for delays without big anger.
    • “Dear EasyJet, how about providing WiFi onboard”
  • Lead: Contact from a customer interested in purchasing a ticket or other product/service in the near future
    • “EasyJet, do you sell group tickets to Prague?”
  • Others: Anything that doesn’t fit into the categories above


As you might outsource the training process (explained later) to a third party or to colleagues clear definitions are extremely important.

With your categories defined, you now need to consider what fields of your interactions should be considered. For our classifier we decided that the interaction.content target contained the relevant information.

2. Gather Data For The Training Set

To carry out machine learning you will need to feed the algorithm a set of training data which has been classified by hand. The algorithm will use this data to identify features (keywords and phrases) that influence how a piece of content is classified.

To form the training set you can extract data from the platform (by running a historic query or recording a stream) and then manually putting each interaction into a category. If you choose to use our scripts use one of our push destinations to collect data as a JSON file choosing the JSON newline delimited format.

To gather raw data for our airline classifier we used the following filter:

We ran this filter as a historic query to collect a list of 2000 interactions as an initial training set. Of course the more data you are able to manually classify, the higher quality your final classifier is likely to be.

NOTE: Remember to remove any duplicates from the dataset you extract. Datasift guarantees to deliver each interaction at least once. If there is a push failure we will try to push data again, and you may receive duplicate interactions. If you are a UNIX platform you can do so at the command line:

sort raw.json | uniq -u > deduped.json

3. Manually Classify Data To Form The Training Set

Now that you have raw set of data, the interactions need to be manually assigned categories to form the training set.

As you are likely to have thousands of data points to classify, you may want to outsource this work. This is why it is vital to have well written definitions of your categories. We chose Airtasker to outsource the work. The advantage we found of Airtasker was that we could have assigned workers that we could communicate with and give feedback to.

We reformatted the raw JSON data as a CSV file to pass to our workers. The file contained the following fields:

  • - Used to rejoin the categories back on to the original interactions
  • interaction.content - The field that the worker needs to examine
  • Category - to be completed by the worker

Again as with the training set size, the more effort you can spend here the better the results will be. You might want to consider asking multiple people to manually classify the data and correlate the results. Even with well written definitions two humans may disagree on right category.

With the results back from Airtasker we now had a raw set of interactions (as a JSON file) and a list of classified interactions (as a CSV file). These two combined formed our training set.

4. Generating A Classifier

With a training set in place the next step is to apply machine learning principles to generate rules for the linear classifier, and generate CSDL scoring rules to implement the classifier.

We implemented the algorithm in Python using the scikit-learn libraries, and the source is available here on GitHub.

At a high level the algorithm carries out these steps:

  • For each interaction in the training set, consider the target fields (in this case interaction.content)
    • Split into two sets, the first for training, the second for testing the classifier later
    • For each training interactions
      • Chunk the content into words and phrases
      • Build a list of candidate features to be considered for rules
  • Add / remove features based on domain knowledge (see below)
  • From the list of features select those with the most influence
  • Generate the classifier based on the selected features, and the interactions that match these features
  • Test the classifier against the training interactions and output results as a confusion matrix
  • Test the classifier against testing interactions put aside earlier
    • For each logging the expected and actual categories assigned
    • Outputting overall results as a confusion matrix
  • Generate CSDL scoring rules from the classifier

The script takes in a raw JSON file of interactions (unclassified) and a CSV of classified interactions, matching the method I’ve explained. You can also specify keywords and phrases to include or exclude as an override to the automatically selected features.

See the GitHub repository for instructions on how to use the script.

Domain Knowledge

The script allows you to specify keywords and phrases that must or must not be considered when generating the classifier. This allows you a level of input into the results based on human experience.

For example we specified the following words should be considered for the airline classifier as we knew they would give us strong signal:


5. Perfecting The Classifier

Your first classifier might not give you a great level of accuracy. Once you have a method working, you may need to spend considerable time iterating and improving your classifier.

You might want to extract a larger set of training data or you may wish to add or remove keywords as you learn more about the data.

The script also allows you to manipulate the parameters passed to the statistical algorithms. Refining these parameters can produce significantly different results.


I hope this post has given you some insight into building a machine learned classifier. It is impossible to give a full proof turnkey method as use cases vary so wildly.

As I said in the introduction, linear classifiers are suited to social data because of their efficiency. You may need to invest significant time perfecting your classifier, this is the nature of machine learning.

Check out our library for more examples of classifiers. We’ll be adding more linear classifiers soon!

To stay in touch with all the latest developer news please subscribe to our RSS feed at

Richard Caudle's picture
Richard Caudle
Updated on Wednesday, 22 January, 2014 - 12:55

In a previous post I introduced you to the new DataSift library, a repository of extremely useful classifiers that will help you build great social solutions quickly. In this post using a practical example I'll show just how useful the classifiers are and how they can help you to start tackling a real-world scenario in minutes.

Analysing Sony's PS4 Launch

On the 14th of November at 11pm eastern time, Sony held their launch event for the PlayStation 4. Naturally there was a large volume of conversation taking place on social platforms around this time.

I decided to dig into this conversation as I wanted to learn:

  • Which news providers were the most influential amongst the interested audience
  • What types of people engaged with the PS4 launch
  • How did each of the above change over time?

Weapons Of Choice

To carry out my analysis I would need to:

  1. Extract historical Twitter data for the period around the event
  2. Push this data to a store where I could perform my analysis
  3. Aggregate and visualise the data so I could look for insights

Although this sounds like a lot of work, with the right tools it was quick and easy:

  • DataSift Historics - provides straight forward access to historic data
  • MySQL Destination - the new MySQL destination makes it really easy to push data to a MySQL database
  • Tableau - an awesome tool that makes data visualisation a breeze. It is able to pull data directly from a MySQL database.

So these three work perfectly in harmony, but I am missing one piece. How could I make sense of the data?

Tableau is great tool, but I needed to deliver data to it in a structure that made sense for my analysis. In this case I needed to add the following metadata to each social interaction so that I could aggregate data appropriately:

  • The news provider whose content is being shared
  • The profession of the user who posted the interaction

Classifiers To The Rescue

Step in the DataSift Library, the hero of our piece. By using classifiers from the library I could attach the required metadata through tagging rules, taking advantage of the new VEDO features.

By using three library items my life became incredibly easy:

  1. Twitter Source - This classifier tells you what application was used to generate content, and splits these applications into buckets including 'manual' (created by a human) and 'automatic' (created by a bot or automated service). As I wanted to focus on engagement by real people, I used this classifier to exclude automated tweets.
  2. News Providers & Topics - This classifier identifies mentions of well known news providers by inspecting domains of links being shared in content. I used this classifier to tag content appropriately with the news provider being mentioned.
  3. Professions & Roles - This classifier identifies the profession of Twitter users from their profile. I used this classifier to tag content with the profession of the author.

Each of the items in the library has a hash, found on it's library page, which is used to apply it to a stream:

Building My Stream

I started by writing a filter to match tweets relating to the launch event. I decided to look for interactions that mentioned PlayStation and keywords (launch, announces, announce). I looked both in the content of the post and also in the title of any content being shared through links.

Note: Links are automatically expanded and decorated with metadata by the Links augmentation.



With my filter created I now wanted to apply metadata to the matching interactions. I used the tags keyword to import the library items I chose, each with the appropriate hash:



My complete stream definition therefore became:



Data With Meaning

The advantage of using the library definitions is that the data is given meaning through metadata, so that when it reaches the application (Tableau in this case) it is much easier to process.

When an interaction arrives it will have a tag_tree property, containing any matching tags. So for example due to the library classifiers I might receive an interaction with the following metadata under interaction.tag_tree:



Gathering The Data

To store the historic data, I set up a MySQL Destination for the destination of the historic query.

I created a fresh new database and used this script to create the necessary tables. I used this mapping file as my mapping definition.

I will cover the MySQL destination very soon in a separate post. You can read the technical documentation here.

I then ran a historic query for Twitter between 13th November 2013 5am GMT and 16th November 5am GMT. 5am being midnight eastern time, so giving me data for the day before, day of and day after the launch event.

Finally I wrote some simple SQL views to augment the data ready for Tableau.



I used the tagged_interactions view for my analysis in Tableau.

Create A Tableau Dashboard

With the metadata attached to each interaction and exposed through my view, analysis with Tableau was quick and painless. I used Tableau to follow how coverage and engagement changed from before until after the launch.

Firstly I created a time series chart, showing the volume of tweets and retweets during the three day period. I plotted the created_at time against the number of records.

I then created a chart for both professions and news providers. The columns I created for the tags were perfect for Tableau.

For professions, I first filtered by tweets that were created manually (thanks to the Twitter Source library item). I then plotted profession (thanks to the Professions & Roles library item) against the number of records.

Similarly for news providers, I first filtered by tweets that were created manually and then plotted provider (thanks to the News Providers & Topics library item) against the number of records.

Finally I used the time series chart as a filter for the other two charts, combining the three on a dashboard so that I could look at activity across different time periods.

The Results

Day Before Launch

The day before the launch the major news providers didn't see great engagement from their audience. The Guardian and CNET managed to generate most engagement. Certainly both providers have a reputation for excellent technology news coverage.

The launch at this point engaged journalists, technologists, entrepreneurs and creative professions. This story is what I would have expected as these groups follow technology and gaming trends most closely in my experience.

Day Of Launch

The day of the launch, the audience engages more widely with a variety of providers. Digital Spy takes a huge lead though with users sharing it's content most widely.

As far as professions is concerned, the audience is split similarly. The news of the launch has not been engaged with widely so far.

Day After Launch

As the launch was at 11pm, it's not that surprising that only fans would have engaged.

However, looking at the engagement after the launch, at this point major providers such USA Today and CNN are managing to engage their audiences. As you might expect the number of professions engaging broadens as the news of the launch is covered more in the mainstream press.

At this point I could have dug deeper. For instance I could have included further news providers that may specialise in such an event. This example though was designed to show how quickly the library can get you started.

Until Next Time...

It's difficult to convey in a blog post, but by using the tagging features of VEDO, the ready-made library definitions and the new MySQL destination, what could have been a couple of days work was complete in a couple of hours.

Not having to post-process the data by writing and running scripts, and having reusable rules I could just pull out of the library made the task so much easier. Also the MySQL destination makes the process of extracting data and visualising it in Tableau completely seamless.

I hope this post gives you some insight into how these tools can be combined effectively. Please take a look at the ever-growing library and see it can make your life easier.

To stay in touch with all the latest developer news please subscribe to our RSS feed at

Richard Caudle's picture
Richard Caudle
Updated on Monday, 20 January, 2014 - 14:29

The launch of DataSift VEDO introduced new features to allow you to add structure to social data. Alongside we introduced the DataSift library to help you build solutions faster and learn quicker.

Today we continue this theme by adding further items to the library. These include examples of machine learned classifiers which are sure to whet your appetite and get your creative juices flowing.

Machine Learned Classifiers

Since we announced VEDO there's been a lot of buzz around the possibilities of machine learning. Look out for a blog post coming very soon for an in-depth look.

We've introduced the following classifiers to the library to give you a taste of just what's possible:

  • Customer Service Routing - Many organisations employ staff to read customer service tweets and route them to the correct team. This classifier is trained specifially for airline customer services and shows how you could automate this process and save staffing costs.
  • Product Purchase Stage -  Knowing at what stage a customer is from initially assessing a product, through to ownership is incredibly powerful. This classifier demonstrates the concept and has been trained for PS4 discussion.
  • People vs Organizations - In many use cases you will want to distinguish between content created by organisations and individuals. This generic classifier allows you to do just that at scale.

These classifiers have been created by our Data Science team. They take a large sample of interactions from the platform, manually classify the interactions and use machine learning to learn key signals, which dictate which category interactions should belong to. The result is a set of scoring rules that form the classifier. The resulting classifier can be run against live or historic data ongoing.

You can try out any of the classifiers now by creating a stream from the example code at the bottom of the library item page. For more details see my previous post.

Geo-Based Classifiers

Knowing a user's location can be extremely valuable for many use cases, yet location as a field can be very tricky to normalise.

As an example of how VEDO can help you with this process, we've introduced the following classifiers, which normalise geo-location information:

  • Major Airports - Categorises tweets made in and around major airports
  • NBA Arenas - Categorises tweets made in and around NBA venues
  • NFL Stadiums - Categorises tweets made in and around NFL stadia.

Outside of game days you'll see little traffic around sporting venues, but try running these on a match day to see the power of these definitions!

Improved Classifiers

Alongside introducing new classifiers and increasing the library's breadth, we've also worked hard on improving further two existing classifiers. We think you'll find these two extremely useful in your solutions:

  • Professions & Roles -  We've restructured the taxonomy to professional function based on the LinkedIn hierarchy.
  • Twitter Source - This classifier has also been restructured to bucket applications into useful categories, including whether content has been manually created (say by a user on their mobile phone) or by an automated service.

Even More To Follow

We're not stopping here. Expect to see more and more items being added to the library, covering a wider range of use cases and industries. Keep an eye out for new items and please watch this blog for further news.

To stay in touch with all the latest developer news please subscribe to our RSS feed at

At this point let me encourage you to sign-up if you're not already a DataSift user, and then jump straight in the library and see it can make your life easier!

Richard Caudle's picture
Richard Caudle
Updated on Thursday, 9 January, 2014 - 12:03

The launch of DataSift VEDO introduced new features to allow you to add structure to social data. Alongside VEDO we also introduced the DataSift library - a great new resource to inspire you and make your life easier. Benefit from our experience and build better solutions faster.

What? Why? Where?

We've introduced the DataSift library to help you to benefit from our experience working with our customers. CSDL (our filtering and tagging language) is extremely powerful, but it might not be clear exactly what can be achieved. Using the library we'll share with you definitions we've written for real-world solutions so you can learn quicker and get the most from our platform.

Currently the library contains tagging and scoring definitions that demonstrate the power of VEDO. There are out-of-the-box components you can use straight away, and example rules you can take and run with.

You'll find the new Library as a new tab on the Streams page:



Supported Out-Of-The-Box Components

Items marked as 'supported' in the library are definitions you can count on us to maintain. You can confidently use these as part of your solution immediately.

You can also use these definitions as a base to start from. You can copy the definitions into your account and modify the rules to fit your use case. After all 'spam' for one use case can be gold for another!

Supported items include:

  • Competitions & Marketing: Scores content to say how likely it is to be noise from competitions and marketing campaigns.
  • Twitter Application Used: Identifies and categorises the source application used to create a tweet - great for picking out content from services, real users and bots.
  • Professions & Roles: Where possible identifies user's profession and seniority based on their profile description.


Real-World Example Solutions

Items marked as 'example' in the library are definitions which we've built that will help you learn from real-world samples. You can run these examples directly, but we envisage you using these definitions as starting points and modifying or extending them to fit your solution.

Example items include:


Using a Library Item

It's easy to make use of a library item. You can either import an item into one of your streams, or copy an item into your account and modify it to your heart's content.

Note that all of the library items are currently tagging and scoring rules. You'll need to use them with a return statement. For more details please see our technical documentation.

Importing a definition

At the bottom of the page for each library item you'll find a tab labelled Usage Examples. This tab shows you example code which you can copy and paste into a new stream and run a live preview.


The key here is the tags keyword and hash for the stream. You can copy and paste this line into any of your streams to import the tagging rules.

Copying a definition

On each library item page there is a snippet of code that shows you the entire, or part of the definition. You can click the Copy to new stream button to copy the entire definition to your account. You can then inspect and modify the code as you see fit.



More To Follow

We'll work hard on adding more and more items to the library so it becomes an extremely valuable resource. Keep an eye out for new items and please watch this blog for further news.

At this point let me encourage you to sign-up if you're not already a DataSift user, and then jump straight in the library and see it can make your life easier!

Richard Caudle's picture
Richard Caudle
Updated on Thursday, 19 December, 2013 - 09:11

The launch of DataSift VEDO introduced new features to allow you to add structure to social data. In my last few posts I introduced you to tags, tag namespaces and scoring and explained how you can use these features to classify data before it reaches your application.

In this post I will show you how once you’ve spent time building these rules, you can reuse them across many projects, getting maximum value for your hard work.

Creating A Reusable Tag Definition

On our platform a ‘tag definition’ is a stream you define which contains only tag rules, and no return statement.

To be clear, until now you may have used tags and a filter (as a return statement) together in one stream for example:

You can break out your tags into a reusable tag definition by saving just the tags in a new stream without the return statement:

When you save the stream (or compile using the API) you will be given a hash which represents this definition.

Using The Definition In A Stream

Now that we have a hash for the tag definition, we can make use of it in another stream using the tags keyword.

This will import the tag rules just as if they were written in place in the same stream. The tags will be applied as appropriate to interactions that match the filter in the return statement.

Using this simple but powerful feature you can create a library of valuable tag & scoring rules for your business model and reuse the same definitions across any number of streams.

Applying A Namespace

This is already a great feature but things get even better when we throw in a namespace. Imagine you have a great set of tag rules you like to reuse often in your streams, you might want to organise your namespaces differently depending on the exact stream.

When you import a set of tags you can wrap them within a namespace:

Now the tags will be applied as before, but they will sit within a top-level namespace of user.

In fact, in practice I’ve found that this helps keep tag rules much more concise. You can declare tag rules with a shallow namespace, but when you import them you can wrap them in namespaces to build a very rich taxonomy.

Something To Remember

It’s important to note that the hash for your tag definition will change if you update the tag definition itself. If you’ve used the stream keyword before you’ll be familiar with this concept.

This makes sense when you consume streams via the API, allowing you to make changes to definitions on the fly and switching to new definitions when suitable for your application.

You just need to remember that if you make a change to a reusable tag definition, make sure you take the new hash and update the streams which import the definition.

Let’s Not Stop There Though…

Reusable tag definitions are super powerful because they allow you to build up a library of rules which you can use across projects and throughout your organisation.

For example, you could build the following reusable definitions:

  • A spam classifier tailored perfectly to your use case
  • A rich taxonomy to exactly fit your business model or industry
  • An influencer model to use across many projects

To give you a head start we’ve also released our own library of reusable definitions. In minutes you can benefit from our hard work!

For full details of the all the features see our technical documentation.

This post concludes my series on our new tagging and scoring features. Don’t go away though as there are many more features I’ll cover in the coming weeks, and I’ll also take you through some much richer real-world examples.


Subscribe to Datasift Documentation Blog