Blog posts in Engineering

Richard Caudle's picture

Validating Interaction Filters with Facebook Super Public Text Samples

DataSift PYLON for Facebook Topic Data allows you to analyze audiences on Facebook whilst protecting users' privacy. To help you build more accurate analysis we're introducing 'Super Public' text samples for Facebook. You can use Super Public text samples to validate your interaction filters to check you are recording the correct data into your index for analysis. You can also use these text samples to train machine learned classifiers. In this post we'll take a look at this new type of data and how to use it for validation of interaction filters.

What are Super Public text samples?

When you work with PYLON you start by creating a filter in CSDL that specifies which stories (posts) and engagements you'd like to record into your index for analysis. You cannot view the raw text of the stories for privacy reasons as these include non-public posts.

Super Public text samples are different because these are stories which the author has chosen to share publicly so for these stories we can give you access to the content. Stories are classed as Super Public if they are:

  • Posted by someone who has “Who can see your future posts?” set to “Public” under their privacy settings
  • Posted by someone who has chosen to make a specific post publicly viewable
  • Posted by someone who has the Follow setting enabled, allowing non-friends to see their stories
  • Not posted to someone else’s timeline

As you have access to the raw content of these stories they are very useful for validating the results of your analysis of the non-public stories and engagements.

Accessing Super Public text samples

If you have Super Public text samples enabled on your account, when you record data using an interaction filter into an index any Super Public stories that match the running interaction filter will be recorded in a separate cache alongside the index. 

You can retrieve the Super Public stories using the pylon/sample endpoint. When you fetch stories they are deleted from the cache so you'll need to store these yourself for future usage. For this endpoint you are limited to fetching 100 stories per hour. 

Validating interaction filters

Let's take a look at an example to make things clearer.

Perhaps you'd like to analyze audiences discussing the US presidential election. You could create a filter such as this: == "Democratic Party" OR == "Democratic Party" \nOR == "Republican National Committee" OR == "Republican National Committee" \nOR fb.content contains_any "Clinton, Trump, Sanders, Carson, Bush"

Running this interaction filter as a recording will store any matching stories into your index. In addition, any Super Public stories that match the conditions will also be stored but into a separate cache.

Fetching text samples

Before you start analyzing the recorded data in your index you no doubt would like to validate the data you are recording. You can do this by fetching data from the pylon/sample endpoint:

curl -X GET '[hash for recording]' \
-H 'Content-type: application/json' \
-H 'Authorization: [your username]:[identity api key]'

Based on the example filter you might get the following Super Public story back:

In this case a story about George H. W. Bush has matched the filter, but we are only interested in Jeb Bush!

You can now amend the last clause of your interaction filter to include the phrase "Jeb Bush" and remove this noise:

fb.content contains_any "Clinton, Trump, Sanders, Carson, Jeb Bush"

Next time..

You can see that Super Public text samples are going to be an important part of your workflow allowing you to validate the data you are recording. This is critical because you cannot access the raw text of the posts in your index.

In a future post we'll take a look at how you can use this data for training machine-learned classifiers so that you can identify sentiment, intention and emotion in Facebook posts.

Richard Caudle's picture

Building Better Machine Learned Classifiers Faster with Active Learning

You might have seen our recent announcement covering many things including the announcement of VEDO Intent. You're probably aware that DataSift VEDO allows you to run machine-learned classifiers. Unfortunately creating a high-quality classifier relies on a good quantity and quality of manually classified training data (which can be a painstaking task to produce) and exploration of machine learning algorithms to get the best possible result. 

VEDO Intent is a tool that helps dramatically cut the time it takes to create a high-quality classifier. Firstly it reduces the time required to manually classify your training set, secondly it automates the process of exploring machine learning algorithms and hyperparameter optimisation, and finally it generates CSDL to use on our platform. Essentially the tool allows you to build high-quality classifiers with no data science experience.

VEDO Intent will be available to customers soon. In this post we'll take a first look at how the tool uses various data science techniques to help you build training data and generate classifiers

Employing Active Learning to Help Build Training Sets

Naturally the accuracy of any machine learning is highly dependent on the size and variation of the training set you provide. When we first released VEDO we also released an open source library based on scikit-learn that allowed you to train linear classifiers, which you can run on our platform. To use this library you need to supply a training set of interaction that you've spent time manually assigning classes to - this can take some time if you want to produce a high quality classifier.

The goal of VEDO Intent is to greatly reduce the time you need to spend manually classifying interactions for your training set. The tool presents to you the interactions it is least certain of, so you only need to spend time classifying the most significant interactions which will improve your classifier.

When you use the tool you are asked to label interactions from your sample data set. The tool presents a few interactions at a time for you to inspect.

Initially the tool presents interactions randomly until it has enough to generate a first model - at least 20 interactions for each class you've specified.

Next, the tool uses your input so far to train a classification model. The model represents a hyperplane in multidimensional space. Interactions that are closest to the hyperplane are the interactions that the tool is least confident of predicting classes for. By asking you to manually classify these interactions the tool knows it will greatly improve the model's accuracy.

The tool will continue to ask you to classify interactions to improve the model until you're happy with the predicted accuracy and it has a minimum number of interactions to train a complete classifier.

Improving Results with Grid Search & Cross Validation

When you've completed your manual labelling of interactions you can generate a full classifier which can be run on our platform.

Our open source library requires you to have a good foundation of data science so that you can test and improve your classifier. VEDO Intent carries out this optimisation for you.

The tool carries out a grid search across feature extraction options, feature selection strategies,  and different machine learning algorithms (including support vector machines and naive bayes) and the associated hyperparameters.

The tool also carries out cross-validation across the grid search selecting the best combination of all the potential options to give you the highest quality classifier.

Using VEDO Intent

When using the tool you carry out the following steps.

1. Uploading sample data

The first step when using the tool is to upload sample data to work with. The data needs to be representative of the data you're looking to classify in future. 

Data is uploaded by dragging a file into the space at the top of the tool. Data needs to be in JSON-newline-delimited format. If you've used our platform before this is how we deliver raw data to you. For instance you could run a historic task to extract data then upload the delivered data to this tool.

2. Deciding on classes

Next the tool will let you specify classes (or labels) you'd like to categorize your data into. 

Deciding on the right classes is very important. You can read more in our previous post about machine learning in Step 1.

3. Labelling interactions

At this point you will be asked to start manually labelling your data. The tool will take you through the process described above which uses active learning techniques. 

For each presented interaction you need to either select a label or skip the interaction. You can also flag interactions to revisit them later if you're unsure what label to assign.

Aside from assigning classes to individual interactions you can also provide hints to the tool by specifying keywords for classes. The tool will use these hints when it next builds a model.

4. Building a classifier

Once you've labelled enough interactions you can train your first classifier. 

The tool will take some time to run this process as it carries out the grid search discussed above. When the grid search is complete the tool will return with the full classification results as well as a CSDL hash you can immediately use on the platform.

5. Reviewing results

The results view allows you to fully assess the quality of the classifier produced. 

The results include standard accuracy measurements (agreement, F1 score, precision, recall)  for the classifier comparing the training set you labelled to predicted results. You can also review the confusion matrix for the classifier.

In addition you can download the 'mismatches' file to see where the prediction model succeeded and failed.

Based on the results you could at this point choose to label more data or revisit the labels and features you input to try to improve your classifier.

6. Running the classifier on the platform

Training a classifier results in CSDL code which is compiled for you on the platform, giving you a hash that you can run on the platform by including it in any filter you choose. 

tags "68ca0c21208e02f996d4ca9e3e10d8f8" \n \nreturn \n{ \n \tinteraction.content contains_any "airways, flight, luggage, departure" \n \tAND language.tag == "en" AND language.confidence >= 70 \n}

You can then make use of these classes in your analysis.


Richard Caudle's picture

Exploring Keyword Relationships in Social Posts with word2vec

You might have seen our recent announcement covering many things including the launch of our new FORUM data science initiative. We're looking to share more of our experience and research to help innovation in social data analysis.

One of the first things we wanted to share is our work exploring the relationships between keywords and terms in social posts. Our data science team has been researching this area using word2vec - a data science library that models the relationship / similarity between words. 

Using word2vec we've started to create models that are trained from large numbers of historical social posts. We're calling these Keyword Relationship Models and have released our first today. You can explore the model using our Keyword Relationship Explorer tool which we announced in another post.

These models are extremely useful as they reflects how language and keywords are really used by people. You can use such models to improve your filters, classification rules and analysis by making sure you include the key terms, phrases and misspellings used around your topic and keep track of them as they evolve over time.

In this post we thought we'd share more detail on how we trained the underlying model.

Challenges of working with keywords

Identifying and expanding on keywords and terms is a key challenge when filtering, classifying and analyzing text data. 

The two key challenges you face when working with keywords are:

  • Coverage - it’s not easy to get a comprehensive list of all the keywords representing a concept. You need to consider variations, synonyms, misspells, new terms, slang and so on.
  • Noise - often keywords may have multiple meanings and therefore their ambiguity could introduce irrelevant data in the classification. This leads to false positives in your results.

By training Keyword Relationship Models we can tackle these challenges as by starting from a seed term such as 'coffee' we can easily learn similar terms to include (such as cappuccino, caffeine, cafe) but also learn which words to exclude (such as bean, plant, roasting) - of course these depend on your use case.

How we build Keyword Relationship Models

A Keyword Relationship Model is based upon word2vec, an open-source library that uses deep learning to represent words and n-grams as vectors in a multidimensional space. Essentially word2vec processes raw text and produces a model which is a representation of words in a vector space. The highly dimensional space created gives words a concept of similarity.

We created the example model through these following steps.

1) Retrieved historic data

We pulled together 2 months of posts in English from our historic archives of social network traffic.

In total this set of interactions was almost 4 billion posts.

2) Pre-processed content

Next we pre-processed content in order to standardize tokens.

Pre-processing steps included:

  • Removing non-ascii characters
  • Lower-cased all content
  • Separated consecutive hashtags and mentions with a white space
  • Replaced newlines with whitespaces
  • Replace &, & with " and "
  • Replaced other HTML entities with a white space
  • Removed hyphens within words
  • Removed single quotes within words
  • Replaced single and consecutive punctuation characters with a single white space
  • Collapsed whitespaces - consecutive whitespaces merged into a single one
  • Alphanumeric strings are kept
  • Hashtags are kept
  • Replaced values for percentage, currency, time, date, mentions, numbers, links with unique placeholder tokens
  • Removing duplicate posts

We then applied tokenization to the cleansed content based on whitespace.

For example if the original piece of content read as follows:

Tom I Love this new #smartphone :)!, the one I found costs 300$ here

After pre-processing we arrived at:

[“Tom”,”i”,“love”,”the”,”new”,”#smartphone”,”the”,”one”,”i”,”found”,”costs”,”PRICE”, ”here”,”LINK”]

3) Processed dataset using word2vec

After these steps the corpus was processed using Word2Vec (skip-gram version and 400 dimensions) to obtain the final model. Terms with less than 40 occurrences were not included while all the others, a total of over 3M, are part of the model including stopwords and placeholder tokens.

4) Queried word2vec model

Once the model was trained we could then query the model for similar words based on a seed term. For example in Python:

model = word2vec.Word2Vec.load_word2vec_format('<path to model file>', binary=True)
similar_words = model.most_similar(positive=['#nfl'], topn=10)
print similar_words

This is how our explorer tool was built.

What's next?

You can start exploring our first model straight away using our Keyword Relationship Explorer tool

Watch this space as we'll be releasing more models in future and giving you more ways to use them in your own projects.

Why not register your interest in our FORUM initiative so you learn more about bringing data science to your projects?

Richard Caudle's picture

How To Update Filters On-The-Fly And Build Dynamic Social Solutions

It would be easy if the world around us was static, but in practice things are always changing. Nowhere is this truer than in the world of social networks; users are constantly following new friends and expressing new thoughts. The filter you wrote yesterday is probably already out-of-date! 
On the DataSift platform you can update your filters on the fly via the API and avoid downtime for your application. This not only allows you to adapt to real-world changing scenarios, but in fact allows you to build much more powerful, dynamic social solutions. In this post I'll show you how this can be done.

Embracing Change

If you've ever built a software solution you'll know that things aren't quite as simple as you'd hope. The real world is always changing. 
For example imagine you're tracking conversation around a news story. You build a simple filter which looks for the terms and personalities involved in the story. This works great, but a few hours later the story has evolved. As stories evolve it is inevitable that the terms people are using to discuss it change. You'll need to react to this without missing any conversations.
Or, maybe you've built an awesome social app that allows users to input their interests and you're creating a filter from input. The next day the user updates their interests. You'll need to update your filter to the new interests without interrupting your service to the user.
A well-designed solution takes change in it's stride. 

Broad Brush Overview

Ok, so we want to build our wonderfully dynamic, super-duper-flexible social solution. What does this mean in practice? On the DataSift side of things we want to be able to update our streams (filtering & tagging) definitions on-the-fly, delivering data to the same destination, without missing any data.
Before we get to the deeper details, the broad principles are:
  • Create V1 of our stream: Build V1 of our stream definition, for instance from user input
  • Start consuming V1: Compile and stream V1 of our stream as usual via the API
  • Create V2 of our stream: Something has changed! Build V2 of our stream to adapt.
  • Start consuming V2: In parallel with streaming V1, we'll start streaming V2 of our stream.
  • Stop consuming V1: When we're happy V2 is streaming nicely, we'll stop streaming V1.
Essentially to avoid downtime (or missing data) we have a brief period where we're streaming both versions in parallel. Note we will need to handle de-duplication during this brief period. 

Let's Do It

Ok, so that's the principles explained. Let's see this in practice.
I wrote a stream last week to track conversations around popular games. Let's use this as an example. 
(For the complete example code take a look at this GIST.)

Create Stream V1

Version 1 of our stream will look for mentions of five popular games; 2048, Farmville 2, Swamp Attack, Trials Frontier and Don't Step The White Tile.
Note this is a simple illustrative example. In practice you might want to look for mentions by inspecting links being shared for instance.

Start Consuming V1

Now that we have our stream defined, we can compile the definition and start consuming data. In this example we'll use the Pull destination to get resulting data.
For this example I'll use the Python helper library.

Create Stream V2

We're now happily consuming data. But wait! There's a new game that's entered the charts that we must track. The game is Clash of the Clans, and it must be added to our filter.
It's easy to imagine you could generate such a filter from an API which gives you the latest game charts.
The updated filter looks as follows (notice the use of the contains_near operator to tolerate missing words from the title):

Start Consuming V2

The next step is to start streaming V2 of the stream in parallel with V1. 

De-duplicating Data

We now have two streams running in parallel. Until we stop stream 1 there's a good chance that the same interaction might be received on both streams, so it's important we de-duplicate the data received. 
How you go about this completely depends on the solution being built. Whatever way you choose you can use the property of the interaction as a unique identifier. One way would be to have a unique key in a database (if this is where your data is being stored),  another simple way would to have a rolling in-memory list of IDs say for the last 5 minutes. Of course this decision depends on the volume of data you expect and the scale of your solution.

Stop Consuming V1

Now that  we have started streaming V2 of the stream we can stop consuming data from V1. 
When you start the second stream it will start immediately. However, if you want to be doubly sure that you do not miss any data we recommend that you wait for the first interaction from stream V2 to be received before stopping stream V1. Note that the platform will charge you for DPUs consumed and data received for each stream individually.

In Conclusion

And so ends my quick tour. I hope this post illustrates how you can switch to new stream definitions on the fly. This capability is likely to be key to real-world solutions you create, and hopefully inspires you to create some truly responsive applications.
For the complete example code take a look at this GIST.
To stay in touch with all the latest developer news please subscribe to our RSS feed at
And, or follow us on Twitter at @DataSiftDev.


Jason's picture

Facebook Pages Managed Source Enhancements

Taking into account some great customer feedback, on May 1st, 2014 we released a number of minor changes to our Facebook Pages Managed Source. 

Potential Breaking Changes

Facebook Page Like and Comment Counts have been Deprecated

The facebook_page.likes_count and facebook_page.comment_count fields have been deprecated from DataSift's output. We found this data became outdated quickly; a better practice for displaying counts of likes and comments in your application is to count like and comment interactions as you receive them. 

Format for facebook_page.message_tags has Changed

facebook_page.message_tags fields were previously in two different formats dependant on whether they came from comments, or posts. This change ensures that all message_tags are provided in a consistent format; as a list of objects. An example of the new consistent format can be seen below:
Please ensure that if your application utilizes these fields, it can handle them as a list of objects.

New Output Fields

We have introduced a number of new output fields in interactions from the Facebook Pages Managed Source. You will be able to filter on many of these fields.

New “Page Like” Interactions

By popular request, we have introduced a new interaction with the subtype “page_like” for anonymous page-level likes.
This should now allow you to track the number of likes for a given page over time.
This subtype has two fields, `current_likes` and `likes_delta`. The first represents the current number of likes for a Facebook Page at the time of retrieval. The second represents the difference with the previously retrieved value. We only generate interactions of this type if `likes_delta` is not zero.  Also note that `likes_delta` can be negative, when the number of unlikes is greater than the number of likes between two retrievals.
This interaction type should allow visualizing page likes as a time series. In addition, filters on `likes_delta` could be used to detect trending pages.

‘from' Fields now Include a Username Where Available

Where it is provided to us, .from fields in Facebook Pages interactions now contain a .username field.
Please note that in some cases, this field is not returned by Facebook.

New Comment ‘Parent' Field

Objects of type comment include an optional .parent object, which contains a reference to a parent comment. The object structure is self-similar.
This will allow you to tell whether comments are nested or not, and associate them with a parent comment if so.

New ‘From’ Field in Post Objects

Objects of type comment/like include an additional .from field in their .post context object, which contains information about the author of the post they are referring to.

New CSDL Targets

We have introduced 12 new Facebook Pages targets. This includes targets to allow you to filter on the likes count of a page, the parent post being commented on, a Facebook user's username, and more. These new targets can all be found in our Facebook Pages targets documentation.

Other Changes

New Notifications for Access Token Issues

If a case occurs where all tokens for a given source have permanent errors, the source will become “disabled", and you will receive a notification. You should then update the source with new tokens, and restart it. 
Note that every error will also be present in the /source/log for that Managed Source.

Summary of Changes

  • facebook_page.likes_count and facebook_page.comment_count fields will be deprecated from DataSift's output
  • The facebook_page.message_tags output field format is changing to become a list of objects
  • We are introducing a new interaction with the subtype “page_like” for anonymous page-level likes
  • .from fields in Facebook Pages interactions now contain a .username field where available
  • Comment interactions will now include a parent object, referencing the parent comment
  • We are introducing a .from field to Facebook Pages .post objects, containing information about the post author
  • We are introducing a number of new CSDL targets for Facebook Pages
  • You will receive better notifications about issues with your Facebook Access Tokens