How To Apply Machine Learning And Give Social Data Meaning

Richard Caudle | 29th January 2014

Recently I've covered the tagging and scoring features of DataSift VEDO. My post on scoring gave a top level overview and a simple example, but might have left you hungry for something a little more meaty. In this post I’ll take you through how we’ve started to build linear classifiers internally using machine learning techniques to tackle more complex classification projects. This post will explain what a linear classifier is, how it can help you and give you a method to get you started building your own.

What Is A Linear Classifier?

Until now you’re likely to have relied on boolean expressions to categorise your social data based on looking at data by eye. A linear classifier, made possible by our new scoring features, allows you to categorise data based on machine learned characteristics over much larger data sets.

A linear classifier is a machine learned method for categorising data. Machine learning is used to identify key characteristics in a training set of data and give each characteristic a weighting to reflect its influence. When the resulting classifier is run against new data each piece of data it is given a score for how likely the data is to belong in each category. The category with the highest score is considered the most appropriate category for the new piece of data.

Linear classifiers are not the most advanced or accurate method of classification, but they are a good match for high volume data sources due to their efficiency and so are perfect for social data. The accuracy of the classifier depends greatly on the definition of the categories, quality and size of the training set and effort to iteratively improve results through tuning.

For this post I will concentrate on how we built the customer service routing classifier in our library. This classifier is designed to help airlines triage incoming customer requests.

Development Environment

Before I start, we use Python for our data science development work. To make use of our scripts you’ll need the following set up:

The Process

To build a classifier you’ll need to carry out the following steps:

  1. Define the categories you want to classify data into and the data points you need to consider
  2. Gather raw data for the training set
  3. Manually classify the raw data to form the training set
  4. Use machine learning to identify characteristics and generate scoring rules
  5. Test the rule set and iterate to improve the classifier’s accuracy

Let’s look at each of these in detail.

1. Define Your Categories & Targets

The first thing you need to consider is what categories are you going to classify your data into. It is essential to invest time upfront considering the categories, and to write for each a strong definition and include a few examples. The more precise and considered you can be here, the more efficient the learning process can be and the more useful your classifier will become.

Make sure your categories are a good fit for their eventual use. You must make sure that no categories overlap and that you have categories so that all possible interactions are covered. So for example you might want to include an 'other' category as we did below.

For the airline classifier, we spent a good amount of time looking into the kind of conversations that surround airline customer services and were inspired by this Altimeter study. We wanted to demonstrate how conversations could be classified for handling by a customer services team.

The categories we finally decided on were:

  • Rant: An emotionally charged criticism that may damage brand image
    • “After tweeting negative comment about EasyJet, I have been refused boarding! My rights has been violated!!!”
  • Rave: Thanks or positive opinion about flight or services
    • “Landing during storm, saved by EasyJet pilot, thanks”
  • Urgency: Request for resolving a real-time issue, including compensation
    • “EasyJet Flight cancelled. I demand compensation now!”
  • Query: A polite or neutral question about how to contact the company, use the website, print boarding card etc.
    • “Where can I find EasyJet hand luggage dimensions?”
  • Feedback: Statement about the flight or service, relating to how it could be improved, including complaints for delays without big anger.
    • “Dear EasyJet, how about providing WiFi onboard”
  • Lead: Contact from a customer interested in purchasing a ticket or other product/service in the near future
    • “EasyJet, do you sell group tickets to Prague?”
  • Others: Anything that doesn’t fit into the categories above

As you might outsource the training process (explained later) to a third party or to colleagues clear definitions are extremely important.

With your categories defined, you now need to consider what fields of your interactions should be considered. For our classifier we decided that the interaction.content target contained the relevant information.

2. Gather Data For The Training Set

To carry out machine learning you will need to feed the algorithm a set of training data which has been classified by hand. The algorithm will use this data to identify features (keywords and phrases) that influence how a piece of content is classified.

To form the training set you can extract data from the platform (by running a historic query or recording a stream) and then manually putting each interaction into a category. If you choose to use our scripts use one of our push destinations to collect data as a JSON file choosing the JSON newline delimited format.

To gather raw data for our airline classifier we used the following filter:

// Only the first message in a conversation (no replies)
twitter.in_reply_to_status_id exists

// Tweets that mention an airline customer services account
and twitter.mentions in "aegeanairlines,AirCanada,Air_Dolomiti,airfrance,AIRNZUSA,AlaskaAir,Alitalia,AmericanAir,asercaairlines,AsianaAirlines,_austrian,AviorAirlines,Belavia_by,British_Airways,BulgariaAir,camairco,cathaypacific,CebuPacificAir,CopaAirlines,CSAIR_GLOBAL,CzechAirlines,DeltaAssist,easyJet,EtihadHelp,EtihadAirways,FinnairUK,FlyAirNZ,flyethiopian,FlyFirefly,FlyFrontier,FlyingBrussels,Fly_Norwegian,flyPAL,flysaa,flySAA_US,flyspicejet,flysrilankan,FlySWISS,fly_uia,FrontierCare,FlyFrontier,HawaiianAir,HawkairAirline,Iberia,Iberia_en,Icelandair,IndiGo6E,IndonesiaGaruda,jetairfly,jetairways,JetBlue,JetstarAirways,Jetstar_NZ,KenyaAirways,KLM,KLM_US,KoreanAir_KE,kulula,KuwaitAirways,LANAirlinesUSA,lufthansa,Lufthansa_USA,MAS,MERPATI_Info,MexicanaAir,Monarch,porterairlines,QantasAirways,QantasUSA,qatarairways,SAA_UK,SAS,Saudi_Airlines,SingaporeAir,SouthwestAir,SunCountryAir,TAMAirlines,taportugal,ThaiAirways,TK_HelpDesk,TurkishAirlines,united,UNITED_AlRLINES,USAirways,VirginAmerica,VirginAtlantic,VirginAustralia,volotea,WestJet"

We ran this filter as a historic query to collect a list of 2000 interactions as an initial training set. Of course the more data you are able to manually classify, the higher quality your final classifier is likely to be.

NOTE: Remember to remove any duplicates from the dataset you extract. Datasift guarantees to deliver each interaction at least once. If there is a push failure we will try to push data again, and you may receive duplicate interactions. If you are a UNIX platform you can do so at the command line:

sort raw.json | uniq -u > deduped.json

3. Manually Classify Data To Form The Training Set

Now that you have raw set of data, the interactions need to be manually assigned categories to form the training set.

As you are likely to have thousands of data points to classify, you may want to outsource this work. This is why it is vital to have well written definitions of your categories. We chose Airtasker to outsource the work. The advantage we found of Airtasker was that we could have assigned workers that we could communicate with and give feedback to.

We reformatted the raw JSON data as a CSV file to pass to our workers. The file contained the following fields:

  • - Used to rejoin the categories back on to the original interactions
  • interaction.content - The field that the worker needs to examine
  • Category - to be completed by the worker

Again as with the training set size, the more effort you can spend here the better the results will be. You might want to consider asking multiple people to manually classify the data and correlate the results. Even with well written definitions two humans may disagree on right category.

With the results back from Airtasker we now had a raw set of interactions (as a JSON file) and a list of classified interactions (as a CSV file). These two combined formed our training set.

4. Generating A Classifier

With a training set in place the next step is to apply machine learning principles to generate rules for the linear classifier, and generate CSDL scoring rules to implement the classifier.

We implemented the algorithm in Python using the scikit-learn libraries, and the source is available here on GitHub.

At a high level the algorithm carries out these steps:

  • For each interaction in the training set, consider the target fields (in this case interaction.content)
    • Split into two sets, the first for training, the second for testing the classifier later
    • For each training interactions
    • Chunk the content into words and phrases
    • Build a list of candidate features to be considered for rules
  • Add / remove features based on domain knowledge (see below)
  • From the list of features select those with the most influence
  • Generate the classifier based on the selected features, and the interactions that match these features
  • Test the classifier against the training interactions and output results as a confusion matrix
  • Test the classifier against testing interactions put aside earlier
    • For each logging the expected and actual categories assigned
    • Outputting overall results as a confusion matrix
  • Generate CSDL scoring rules from the classifier

The script takes in a raw JSON file of interactions (unclassified) and a CSV of classified interactions, matching the method I’ve explained. You can also specify keywords and phrases to include or exclude as an override to the automatically selected features.

See the GitHub repository for instructions on how to use the script.

Domain Knowledge

The script allows you to specify keywords and phrases that must or must not be considered when generating the classifier. This allows you a level of input into the results based on human experience.

For example we specified the following words should be considered for the airline classifier as we knew they would give us strong signal:


5. Perfecting The Classifier

Your first classifier might not give you a great level of accuracy. Once you have a method working, you may need to spend considerable time iterating and improving your classifier.

You might want to extract a larger set of training data or you may wish to add or remove keywords as you learn more about the data.

The script also allows you to manipulate the parameters passed to the statistical algorithms. Refining these parameters can produce significantly different results.


I hope this post has given you some insight into building a machine learned classifier. It is impossible to give a full proof turnkey method as use cases vary so wildly.

As I said in the introduction, linear classifiers are suited to social data because of their efficiency. You may need to invest significant time perfecting your classifier, this is the nature of machine learning.

Check out our library for more examples of classifiers. We’ll be adding more linear classifiers soon!

To stay in touch with all the latest developer news please subscribe to our RSS feed at

Previous post: Using Library Classifiers To Analyse A Product Launch

Next post: Introducing Wildcards - Making Powerful Filters Even Simpler