Building Better Machine Learned Classifiers Faster with Active Learning

Richard Caudle | 14th July 2015

You might have seen our recent announcement covering many things including the announcement of VEDO Intent. You're probably aware that DataSift VEDO allows you to run machine-learned classifiers. Unfortunately creating a high-quality classifier relies on a good quantity and quality of manually classified training data (which can be a painstaking task to produce) and exploration of machine learning algorithms to get the best possible result.

VEDO Intent is a tool that helps dramatically cut the time it takes to create a high-quality classifier. Firstly it reduces the time required to manually classify your training set, secondly it automates the process of exploring machine learning algorithms and hyperparameter optimisation, and finally it generates CSDL to use on our platform. Essentially the tool allows you to build high-quality classifiers with no data science experience.

VEDO Intent will be available to customers soon. In this post we'll take a first look at how the tool uses various data science techniques to help you build training data and generate classifiers

Employing Active Learning to Help Build Training Sets

Naturally the accuracy of any machine learning is highly dependent on the size and variation of the training set you provide. When we first released VEDO we also released an open source library based on scikit-learn that allowed you to train linear classifiers, which you can run on our platform. To use this library you need to supply a training set of interaction that you've spent time manually assigning classes to - this can take some time if you want to produce a high quality classifier.

The goal of VEDO Intent is to greatly reduce the time you need to spend manually classifying interactions for your training set. The tool presents to you the interactions it is least certain of, so you only need to spend time classifying the most significant interactions which will improve your classifier.

When you use the tool you are asked to label interactions from your sample data set. The tool presents a few interactions at a time for you to inspect.

Initially the tool presents interactions randomly until it has enough to generate a first model - at least 20 interactions for each class you've specified.

Next, the tool uses your input so far to train a classification model. The model represents a hyperplane in multidimensional space. Interactions that are closest to the hyperplane are the interactions that the tool is least confident of predicting classes for. By asking you to manually classify these interactions the tool knows it will greatly improve the model's accuracy.


The tool will continue to ask you to classify interactions to improve the model until you're happy with the predicted accuracy and it has a minimum number of interactions to train a complete classifier.

Improving Results with Grid Search & Cross Validation

When you've completed your manual labelling of interactions you can generate a full classifier which can be run on our platform.

Our open source library requires you to have a good foundation of data science so that you can test and improve your classifier. VEDO Intent carries out this optimisation for you.

The tool carries out a grid search across feature extraction options, feature selection strategies, and different machine learning algorithms (including support vector machines and naive bayes) and the associated hyperparameters.

The tool also carries out cross-validation across the grid search selecting the best combination of all the potential options to give you the highest quality classifier.

Using VEDO Intent

When using the tool you carry out the following steps.

1. Uploading sample data

The first step when using the tool is to upload sample data to work with. The data needs to be representative of the data you're looking to classify in future.


Data is uploaded by dragging a file into the space at the top of the tool. Data needs to be in JSON-newline-delimited format. If you've used our platform before this is how we deliver raw data to you. For instance you could run a historic task to extract data then upload the delivered data to this tool.

2. Deciding on classes

Next the tool will let you specify classes (or labels) you'd like to categorize your data into.


Deciding on the right classes is very important. You can read more in our previous post about machine learning in Step 1.

3. Labelling interactions

At this point you will be asked to start manually labelling your data. The tool will take you through the process described above which uses active learning techniques.


For each presented interaction you need to either select a label or skip the interaction. You can also flag interactions to revisit them later if you're unsure what label to assign.

Aside from assigning classes to individual interactions you can also provide hints to the tool by specifying keywords for classes. The tool will use these hints when it next builds a model.

4. Building a classifier

Once you've labelled enough interactions you can train your first classifier.

The tool will take some time to run this process as it carries out the grid search discussed above. When the grid search is complete the tool will return with the full classification results as well as a CSDL hash you can immediately use on the platform.

5. Reviewing results

The results view allows you to fully assess the quality of the classifier produced.

The results include standard accuracy measurements (agreement, F1 score, precision, recall) for the classifier comparing the training set you labelled to predicted results. You can also review the confusion matrix for the classifier.


In addition you can download the 'mismatches' file to see where the prediction model succeeded and failed.

Based on the results you could at this point choose to label more data or revisit the labels and features you input to try to improve your classifier.

6. Running the classifier on the platform

Training a classifier results in CSDL code which is compiled for you on the platform, giving you a hash that you can run on the platform by including it in any filter you choose.

tags "68ca0c21208e02f996d4ca9e3e10d8f8" 
    interaction.content contains_any "airways, flight, luggage, departure" 
    AND language.tag == "en" AND language.confidence >= 70 

You can then make use of these classes in your analysis.

Previous post: Open Data Processing for Twitter - Now Available

Next post: Exploring Keyword Relationships in Social Posts with word2vec