Blog

Richard Caudle's picture

DataSift PYLON - The value of unified data processing

Applying CSDL and VEDO to Facebook topic data

You might have seen our recent announcement which enables developers to gain insights from Facebook topic data. No doubt you're eager to learn more! In this post we'll look at how easy it is to take CSDL you've fine-tuned for filtering data from networks such as Twitter or Tumblr, and apply this in PYLON both for not only filtering but also for classifying Facebook topic data. This demonstrates the simplicity of using a unified data processing platform. 
 
Before we jump in too deep, at this point you're probably keen to know exactly what PYLON is. Let's give you a quick intro…

What is PYLON?

PYLON is a new API that enables a privacy-first approach to analyzing Facebook posts and engagement data. Using the PYLON API, for the first time you can build insights from Facebook's topic data.
 
With PYLON you can analyze the billions of posts and media engagement that takes place on Facebook every day, but respect the privacy of individual people using Facebook. 
 
The sources of aggregate and anonymized information that can be extracted from Facebook topic data can be broadly categorized as posts and engagement data:
 
  • Posts on pages: Content, Topics, Links and Hashtags
  • Engagement data: Comment, Shares, Likes

Privacy-First

PYLON gives you access to Facebook topic data, but it also protects the identity of individuals on Facebook. Personally identifiable information (PII) is never exposed.

  • You receive statistical summarized results, never individual posts.
  • A minimum audience-threshold of 100 is applied to any analysis you perform to protect from individual-level analysis.
  • Data is processed within Facebook’s own data centre. Raw data never leaves Facebook’s servers.
  • Interaction data is only available for analysis for up to 30 days, after which time it is deleted from the service.

So with PYLON you can generate insights that before now were not possible, yet in a way that ensure privacy for Facebook users!

PYLON Workflow

If you've worked with DataSift before you'll be used to the Stream workflow:
 
  • Create filters in CSDL against your enabled data sources
  • Add classification rules to your filters using VEDO
  • Deliver raw, classified data to your chosen destination
 
PYLON is different, as to maintain privacy the output is delivered as summary results:
 
  • Create a filter that matches interactions on Facebook
  • Include classification rules to add value to the data
  • Record output from this filter into a private index
  • Submit analysis queries to the index to receive analysis results
 
 
The data you filter is recorded to a PYLON index only you have access to. You access your index using analysis queries which return summary statistical results from Facebook topic data.
 

Apply your CSDL expertise to Facebook topic data

So until now if you're a DataSift customer you'll have been using our Stream product, this provides access to filter from a variety of social data sources such as Twitter, Tumblr, and blogs.  
 
In this scenario, you'll have created filters, for example to capture tweets about popular mobile games:
 
( \n\tinteraction.content contains_any "Game of War,Boom Beach,Pet Rescue,Candy Crush,Don't Tap, \n2048,FarmVille 2" \n\tOR interaction.content contains_near "Clash,Clans:5" \n\tOR interaction.content contains_near "Puzzle,Dragons:5" \n\tOR interaction.content contains_all "step,white,tile" \n) \nAND NOT interaction.content wildcard "cheat*" \nAND NOT interaction.content wildcard "hack*"
 
Even if you've not familiar with CSDL syntax, you can see how easy it is to work across different data fields (which we call targets), make use of operators such as contains_any and combine many conditions with boolean operators.
 

Applying CSDL to Facebook topic data

With PYLON just like Stream your first step is to create a filter for the interactions you'd like to record to your index.
 
The best thing about PYLON is that under the hood it uses exactly the same filtering engine to filter Facebook topic data stream, just as is used to filter the Twitter firehose.
 
So, let's take our example CSDL from above. If we'd like to index the Facebook topic data which covers exactly the same criteria as we used with Twitter:
 
( \n\tinteraction.content contains_any "Game of War,Boom Beach,Pet Rescue,Candy Crush,Don't Tap, \n2048,FarmVille 2" \n\tOR interaction.content contains_near "Clash,Clans:5" \n\tOR interaction.content contains_near "Puzzle,Dragons:5" \n\tOR interaction.content contains_all "step,white,tile" \n) \nAND NOT interaction.content wildcard "cheat*" \nAND NOT interaction.content wildcard "hack*"
 
Yes, that's right - the CSDL is exactly the same! 
 
This works because common data fields in both sources are mapped to the interaction namespace. In this case we're making heavy use of interaction.content which is the content of the post or tweet. Also regardless of the data field you're filtering upon, the operators such as contains_any work in exactly the same way too.
 
Of course, you can also utilise the unique data available from each network. For example, Facebook topic data contains audience demographics and topic-data which is not present in Twitter.  With this, you could expand your CSDL to focus your analysis on for males talking about cars who live in California.
 
fb.author.gender == "male" and fb.author.region == "California" \nand fb.topics.category == "Cars"
 
With Facebook topic data you can filter on demographics, but what's even better is that PYLON ensures privacy for Facebook users as returned results are anonymous, aggregated summaries.
 

Classifying with VEDO

The power of CSDL doesn't stop at filtering to identify data for analysis. You can use the same language, operators and data fields to classify your data using VEDO.
 
Classifying data this way is particularly important when using PYLON as privacy guards prevent you from accessing raw data. By using VEDO with PYLON you can add extra value to the data before it is saved in your index, and then make use of this extra data when submitting your analysis queries.
 
Let's build on our example above by adding tags to data that has been filtered. We can do so by adding tag rules to our filter:
 
tag.game "Candy Crush" { interaction.content contains "Candy Crush"} \ntag.game "Game of War" { interaction.content contains "Game of War"} \ntag.game "Boom Beach" { interaction.content contains "Boom Beach"} \ntag.game "Pet Rescue" { interaction.content contains "Pet Rescue"} \ntag.game "Don't Tap" { interaction.content contains "Don't Tap"} \ntag.game "2048" { interaction.content contains "2048"} \ntag.game "FarmVille 2" { interaction.content contains "FarmVille 2"} \ntag.game "Clash of the Clans" { interaction.content contains_near "Clash,Clans:5"} \ntag.game "Puzzles & Dragons" { interaction.content contains_near "Puzzles,Dragons:5"} \ntag.game "Don't Step On The White Tile" { interaction.content contains_all "step,white,tile" }
 
Again this code can be run on Twitter data using the Stream product or against Facebook data in PYLON. 
 
We can take things further by applying machine learned classifiers. For instance we could create a custom sentiment classifier, then apply this to both Facebook topic data and Twitter.
 

Analyzing Facebook topic data

Finally to complete our workflow we let's take a quick look at submitting analysis queries in PYLON. As you cannot access the raw data this is how you get your insights from the data your recorded to your index.
 
When you submit a query to PYLON you specify the data field you'd like to analyze, what analysis to perform and which segment of the data in your index to analyze.
 
Here you get to use your CSDL skills once more, as CSDL is used to filter the data in your index into precise segments. You can drill down deeper and deeper to get very detailed insights.
 
As a quick example we could start by analyzing the age groups in our entire dataset as a frequency distribution. In PYLON you specify the following arguments:
 
{

    "analysis_type":"freqDist",
    "parameters":
    {
        "target":"fb.author.age"
    }
}
 
Using CSDL we can take this further selecting a precise subset, based upon demographics and tags we introduced in our example. To do so we simply specify a filter when making the analysis request:
 
interaction.tag_tree.game == "Candy Crush" AND fb.author.gender == "female" \nAND fb.author.region == "UK"
 
Immediately this gives us some powerful insights, breaking down players by age group:
 
 
This is just a taste of how powerful PYLON and VEDO are when analyzing Facebook topic data! In future posts we'll look at filtering, classification and analysis queries in much more depth.

 

Richard Caudle's picture

New Community Site Launched!

Things change very fast at DataSift, it can be hard to keep up.
 
This week we've released our new community site at community.datasift.com.
 
 
 
The community site hosts our forums going forwards. It's the best place to ask questions to be answered by our staff and other developers, and you can also leave feedback and make suggestions to our team.
 
Another role of the community site is to keep you better informed of platform changes and to build a community around our platform. The Announcments category will be used to keep you up-to-date with platform changes, and soon we will hosting events for developers which will also be announced to the community.
 
Of course if you have a support package our Support team is always on hand too.

Subscribing to Announcements

To subscribe to new posts in any category (although Announcements will probably be the most interesting to get you started), you need to do the following:

  • Visit community.datasift.com
  • Click Log In in the top right corner
  • The new site has SSO integration with dev.datasift.com. If you have an account already sign in, if not register for a developer account.
  • Go to the Announcements category 
  • Click the top right dropdown and choose Watching

 
 
Richard Caudle's picture

How To Update Filters On-The-Fly And Build Dynamic Social Solutions

It would be easy if the world around us was static, but in practice things are always changing. Nowhere is this truer than in the world of social networks; users are constantly following new friends and expressing new thoughts. The filter you wrote yesterday is probably already out-of-date! 
 
On the DataSift platform you can update your filters on the fly via the API and avoid downtime for your application. This not only allows you to adapt to real-world changing scenarios, but in fact allows you to build much more powerful, dynamic social solutions. In this post I'll show you how this can be done.
 

Embracing Change

If you've ever built a software solution you'll know that things aren't quite as simple as you'd hope. The real world is always changing. 
 
For example imagine you're tracking conversation around a news story. You build a simple filter which looks for the terms and personalities involved in the story. This works great, but a few hours later the story has evolved. As stories evolve it is inevitable that the terms people are using to discuss it change. You'll need to react to this without missing any conversations.
 
Or, maybe you've built an awesome social app that allows users to input their interests and you're creating a filter from input. The next day the user updates their interests. You'll need to update your filter to the new interests without interrupting your service to the user.
 
A well-designed solution takes change in it's stride. 
 

Broad Brush Overview

Ok, so we want to build our wonderfully dynamic, super-duper-flexible social solution. What does this mean in practice? On the DataSift side of things we want to be able to update our streams (filtering & tagging) definitions on-the-fly, delivering data to the same destination, without missing any data.
 
Before we get to the deeper details, the broad principles are:
 
  • Create V1 of our stream: Build V1 of our stream definition, for instance from user input
  • Start consuming V1: Compile and stream V1 of our stream as usual via the API
  • Create V2 of our stream: Something has changed! Build V2 of our stream to adapt.
  • Start consuming V2: In parallel with streaming V1, we'll start streaming V2 of our stream.
  • Stop consuming V1: When we're happy V2 is streaming nicely, we'll stop streaming V1.
Essentially to avoid downtime (or missing data) we have a brief period where we're streaming both versions in parallel. Note we will need to handle de-duplication during this brief period. 
 

Let's Do It

Ok, so that's the principles explained. Let's see this in practice.
 
I wrote a stream last week to track conversations around popular games. Let's use this as an example. 
 
(For the complete example code take a look at this GIST.)
 

Create Stream V1

Version 1 of our stream will look for mentions of five popular games; 2048, Farmville 2, Swamp Attack, Trials Frontier and Don't Step The White Tile.
 
Note this is a simple illustrative example. In practice you might want to look for mentions by inspecting links being shared for instance.
 
 

Start Consuming V1

Now that we have our stream defined, we can compile the definition and start consuming data. In this example we'll use the Pull destination to get resulting data.
 
For this example I'll use the Python helper library.
 
 

Create Stream V2

We're now happily consuming data. But wait! There's a new game that's entered the charts that we must track. The game is Clash of the Clans, and it must be added to our filter.
 
It's easy to imagine you could generate such a filter from an API which gives you the latest game charts.
 
The updated filter looks as follows (notice the use of the contains_near operator to tolerate missing words from the title):
 
 

Start Consuming V2

The next step is to start streaming V2 of the stream in parallel with V1. 
 
 

De-duplicating Data

We now have two streams running in parallel. Until we stop stream 1 there's a good chance that the same interaction might be received on both streams, so it's important we de-duplicate the data received. 
 
How you go about this completely depends on the solution being built. Whatever way you choose you can use the interaction.id property of the interaction as a unique identifier. One way would be to have a unique key in a database (if this is where your data is being stored),  another simple way would to have a rolling in-memory list of IDs say for the last 5 minutes. Of course this decision depends on the volume of data you expect and the scale of your solution.
 

Stop Consuming V1

Now that  we have started streaming V2 of the stream we can stop consuming data from V1. 
 
When you start the second stream it will start immediately. However, if you want to be doubly sure that you do not miss any data we recommend that you wait for the first interaction from stream V2 to be received before stopping stream V1. Note that the platform will charge you for DPUs consumed and data received for each stream individually.
 
 

In Conclusion

And so ends my quick tour. I hope this post illustrates how you can switch to new stream definitions on the fly. This capability is likely to be key to real-world solutions you create, and hopefully inspires you to create some truly responsive applications.
 
For the complete example code take a look at this GIST.
 
To stay in touch with all the latest developer news please subscribe to our RSS feed at http://dev.datasift.com/blog/feed
 
And, or follow us on Twitter at @DataSiftDev.

 

Jason's picture

Facebook Pages Managed Source Enhancements

Taking into account some great customer feedback, on May 1st, 2014 we released a number of minor changes to our Facebook Pages Managed Source. 
 

Potential Breaking Changes

Facebook Page Like and Comment Counts have been Deprecated

The facebook_page.likes_count and facebook_page.comment_count fields have been deprecated from DataSift's output. We found this data became outdated quickly; a better practice for displaying counts of likes and comments in your application is to count like and comment interactions as you receive them. 
 

Format for facebook_page.message_tags has Changed

facebook_page.message_tags fields were previously in two different formats dependant on whether they came from comments, or posts. This change ensures that all message_tags are provided in a consistent format; as a list of objects. An example of the new consistent format can be seen below:
 
 
Please ensure that if your application utilizes these fields, it can handle them as a list of objects.
 
 

New Output Fields

We have introduced a number of new output fields in interactions from the Facebook Pages Managed Source. You will be able to filter on many of these fields.
 

New “Page Like” Interactions

By popular request, we have introduced a new interaction with the subtype “page_like” for anonymous page-level likes.
This should now allow you to track the number of likes for a given page over time.
 
 
This subtype has two fields, `current_likes` and `likes_delta`. The first represents the current number of likes for a Facebook Page at the time of retrieval. The second represents the difference with the previously retrieved value. We only generate interactions of this type if `likes_delta` is not zero.  Also note that `likes_delta` can be negative, when the number of unlikes is greater than the number of likes between two retrievals.
 
This interaction type should allow visualizing page likes as a time series. In addition, filters on `likes_delta` could be used to detect trending pages.
 

‘from' Fields now Include a Username Where Available

Where it is provided to us, .from fields in Facebook Pages interactions now contain a .username field.
 
 
Please note that in some cases, this field is not returned by Facebook.
 

New Comment ‘Parent' Field

Objects of type comment include an optional .parent object, which contains a reference to a parent comment. The object structure is self-similar.
 
This will allow you to tell whether comments are nested or not, and associate them with a parent comment if so.
 
 

New ‘From’ Field in Post Objects

Objects of type comment/like include an additional .from field in their .post context object, which contains information about the author of the post they are referring to.
 
 

New CSDL Targets

We have introduced 12 new Facebook Pages targets. This includes targets to allow you to filter on the likes count of a page, the parent post being commented on, a Facebook user's username, and more. These new targets can all be found in our Facebook Pages targets documentation.
 

Other Changes

New Notifications for Access Token Issues

If a case occurs where all tokens for a given source have permanent errors, the source will become “disabled", and you will receive a notification. You should then update the source with new tokens, and restart it. 
 
Note that every error will also be present in the /source/log for that Managed Source.
 

Summary of Changes

  • facebook_page.likes_count and facebook_page.comment_count fields will be deprecated from DataSift's output
  • The facebook_page.message_tags output field format is changing to become a list of objects
  • We are introducing a new interaction with the subtype “page_like” for anonymous page-level likes
  • .from fields in Facebook Pages interactions now contain a .username field where available
  • Comment interactions will now include a parent object, referencing the parent comment
  • We are introducing a .from field to Facebook Pages .post objects, containing information about the post author
  • We are introducing a number of new CSDL targets for Facebook Pages
  • You will receive better notifications about issues with your Facebook Access Tokens
 
Richard Caudle's picture

Platform Updates - Content Age Filtering, Larger Compressed Data Deliveries

This is a quick post to update you on some changes we've introduced recently to help you work with our platform and make your life a little easier.
 

Filtering On Content Age

We aim to deliver you data as soon as we possibly can, but for some sources there can be a delay between publication to the web and our delivery which is out of our control.
 
In most cases this does not have an impact, but in some situations (perhaps you only want to display extremely fresh content to a user) this is an issue.
 
For these sources we have introduced a new target, .age, which allows you to specify the maximum time since the content was posted. For instance if you want to filter on blog posts mentioning 'DataSift', making sure that you only receive posts published within the last hour:
 
blog.content contains "DataSift" AND blog.age < 3600
 
This new target applies to the Blog, Board, DailyMotion, IMDB, Reddit, Topix, Video and YouTube sources.
 

Push Destinations - New Payload Options

Many of our customers are telling us they can take much larger data volumes from our system. We aim to please, so have introduced options to help you get more data quicker.
 

Increased Payload Sizes

To enable you to receive more data quicker from our push connectors, we have upped the maximum delivery sizes for many of our destinations. See the table below for the new maximum delivery sizes.
 

Compression Support

As the data we deliver to you is text, compression can be used to greatly reduce the size of files we deliver, making transport far more efficient. Although compression rates do vary, we are typically seeing an 80% reduction in file size with this option enabled.
 
We have introduced GZip and ZLib compression to our most popular destinations. You can enable compression on a destination by selecting the option in your dashboard, or by specifying the output_param.compression parameter through the API.
 
When data is delivered you can tell it has been compressed in two ways:
 
HTTP destination: The HTTP header 'X-DataSift-Compression' will have the value none, zlib or gzip as appropriate
S3, SFTP destinations: Files delivered to your destination will have an addition '.gz' extension is they have been compressed, for example DataSift-xxxxxxxxxxxxxxxxxxx-yyyyyyy.json.gz
 
Here's a summary of our current push destinations support for these features.
 
Destination Maximum Payload Size Compression Support
HTTP 200 MB GZip, ZLib
S3 200 MB GZip
SFTP 50 MB GZip
CouchDB 50 MB  
ElasticSearch 200 MB  
FTP 200 MB  
MongoDB 50 MB  
MySQL 50 MB  
PostgreSQL 50 MB  
Pull 50 MB  
Redis 50 MB  
Splunk 50 MB  

Stay Up-To-Date

To stay in touch with all the latest developer news please subscribe to our RSS feed at http://dev.datasift.com/blog/feed
 
And, or follow us on Twitter at @DataSiftDev

Pages

Subscribe to Datasift Documentation Blog