DataSift PYLON - The value of unified data processing

Richard Caudle | 13th March 2015

Applying CSDL and VEDO to Facebook topic data

You might have seen our recent announcement which enables developers to gain insights from Facebook topic data. No doubt you're eager to learn more! In this post we'll look at how easy it is to take CSDL you've fine-tuned for filtering data from networks such as Twitter or Tumblr, and apply this in PYLON both for not only filtering but also for classifying Facebook topic data. This demonstrates the simplicity of using a unified data processing platform.

Before we jump in too deep, at this point you're probably keen to know exactly what PYLON is. Let's give you a quick intro…

What is PYLON?

PYLON is a new API that enables a privacy-first approach to analyzing Facebook posts and engagement data. Using the PYLON API, for the first time you can build insights from Facebook's topic data.

With PYLON you can analyze the billions of posts and media engagement that takes place on Facebook every day, but respect the privacy of individual people using Facebook.

The sources of aggregate and anonymized information that can be extracted from Facebook topic data can be broadly categorized as posts and engagement data:

  • Posts on pages: Content, Topics, Links and Hashtags
  • Engagement data: Comment, Shares, Likes

Privacy-First

PYLON gives you access to Facebook topic data, but it also protects the identity of individuals on Facebook. Personally identifiable information (PII) is never exposed.

  • You receive statistical summarized results, never individual posts.
  • A minimum audience-threshold of 100 is applied to any analysis you perform to protect from individual-level analysis.
  • Data is processed within Facebook’s own data centre. Raw data never leaves Facebook’s servers.
  • Interaction data is only available for analysis for up to 30 days, after which time it is deleted from the service.

So with PYLON you can generate insights that before now were not possible, yet in a way that ensure privacy for Facebook users!

PYLON Workflow

If you've worked with DataSift before you'll be used to the Stream workflow:

  • Create filters in CSDL against your enabled data sources
  • Add classification rules to your filters using VEDO
  • Deliver raw, classified data to your chosen destination

PYLON is different, as to maintain privacy the output is delivered as summary results:

  • Create a filter that matches interactions on Facebook
  • Include classification rules to add value to the data
  • Record output from this filter into a private index
  • Submit analysis queries to the index to receive analysis results
facebook_pylon_your_app

The data you filter is recorded to a PYLON index only you have access to. You access your index using analysis queries which return summary statistical results from Facebook topic data.

Apply your CSDL expertise to Facebook topic data

So until now if you're a DataSift customer you'll have been using our Stream product, this provides access to filter from a variety of social data sources such as Twitter, Tumblr, and blogs.

In this scenario, you'll have created filters, for example to capture tweets about popular mobile games:

( 
    interaction.content contains_any "Game of War,Boom Beach,Pet Rescue,Candy Crush,Don't Tap, 2048,FarmVille 2" 
    OR interaction.content contains_near "Clash,Clans:5" 
    OR interaction.content contains_near "Puzzle,Dragons:5" 
    OR interaction.content contains_all "step,white,tile" 
) 
AND NOT interaction.content wildcard "cheat\*" 
AND NOT interaction.content wildcard "hack\*"

Even if you've not familiar with CSDL syntax, you can see how easy it is to work across different data fields (which we call targets), make use of operators such as contains_any and combine many conditions with boolean operators.

Applying CSDL to Facebook topic data

With PYLON just like Stream your first step is to create a filter for the interactions you'd like to record to your index.

The best thing about PYLON is that under the hood it uses exactly the same filtering engine to filter Facebook topic data stream, just as is used to filter the Twitter firehose.

So, let's take our example CSDL from above. If we'd like to index the Facebook topic data which covers exactly the same criteria as we used with Twitter:

( 
    interaction.content contains_any "Game of War,Boom Beach,Pet Rescue,Candy Crush,Don't Tap, 2048,FarmVille 2" 
    OR interaction.content contains_near "Clash,Clans:5" 
    OR interaction.content contains_near "Puzzle,Dragons:5" 
    OR interaction.content contains_all "step,white,tile" 
) 
AND NOT interaction.content wildcard "cheat\*" 
AND NOT interaction.content wildcard "hack\*"

Yes, that's right - the CSDL is exactly the same!

This works because common data fields in both sources are mapped to the interaction namespace. In this case we're making heavy use of interaction.content which is the content of the post or tweet. Also regardless of the data field you're filtering upon, the operators such as contains_any work in exactly the same way too.

Of course, you can also utilise the unique data available from each network. For example, Facebook topic data contains audience demographics and topic-data which is not present in Twitter. With this, you could expand your CSDL to focus your analysis on for males talking about cars who live in California.

fb.author.gender == "male" and fb.author.region == "California" 
and fb.topics.category == "Cars"

With Facebook topic data you can filter on demographics, but what's even better is that PYLON ensures privacy for Facebook users as returned results are anonymous, aggregated summaries.

Classifying with VEDO

The power of CSDL doesn't stop at filtering to identify data for analysis. You can use the same language, operators and data fields to classify your data using VEDO.

Classifying data this way is particularly important when using PYLON as privacy guards prevent you from accessing raw data. By using VEDO with PYLON you can add extra value to the data before it is saved in your index, and then make use of this extra data when submitting your analysis queries.

Let's build on our example above by adding tags to data that has been filtered. We can do so by adding tag rules to our filter:

tag.game "Candy Crush" { interaction.content contains "Candy Crush"} 
tag.game "Game of War" { interaction.content contains "Game of War"} 
tag.game "Boom Beach" { interaction.content contains "Boom Beach"} 
tag.game "Pet Rescue" { interaction.content contains "Pet Rescue"} 
tag.game "Don't Tap" { interaction.content contains "Don't Tap"} 
tag.game "2048" { interaction.content contains "2048"} 
tag.game "FarmVille 2" { interaction.content contains "FarmVille 2"} 
tag.game "Clash of the Clans" { interaction.content contains_near "Clash,Clans:5"} 
tag.game "Puzzles & Dragons" { interaction.content contains_near "Puzzles,Dragons:5"} 
tag.game "Don't Step On The White Tile" { interaction.content contains_all "step,white,tile" }

Again this code can be run on Twitter data using the Stream product or against Facebook data in PYLON.

We can take things further by applying machine learned classifiers. For instance we could create a custom sentiment classifier, then apply this to both Facebook topic data and Twitter.

Analyzing Facebook topic data

Finally to complete our workflow we let's take a quick look at submitting analysis queries in PYLON. As you cannot access the raw data this is how you get your insights from the data your recorded to your index.

When you submit a query to PYLON you specify the data field you'd like to analyze, what analysis to perform and which segment of the data in your index to analyze.

Here you get to use your CSDL skills once more, as CSDL is used to filter the data in your index into precise segments. You can drill down deeper and deeper to get very detailed insights.

As a quick example we could start by analyzing the age groups in our entire dataset as a frequency distribution. In PYLON you specify the following arguments:

{
    "analysis_type":"freqDist",
    "parameters":
    {
        "target":"fb.author.age"
    }
}

Using CSDL we can take this further selecting a precise subset, based upon demographics and tags we introduced in our example. To do so we simply specify a filter when making the analysis request:

interaction.tag_tree.game == "Candy Crush" AND fb.author.gender == "female" 
AND fb.author.region == "UK"

Immediately this gives us some powerful insights, breaking down players by age group:

tableau-age-breakdown_1

This is just a taste of how powerful PYLON and VEDO are when analyzing Facebook topic data! In future posts we'll look at filtering, classification and analysis queries in much more depth.


Previous post: New Community Site Launched!

Next post: Open Data Processing for Twitter - Now Available