Richard Caudle | 20th October 2015
DataSift PYLON for Facebook Topic Data allows you to analyze audiences on Facebook whilst protecting users' privacy. To help you build more accurate analysis we're introducing 'Super Public' text samples for Facebook. You can use Super Public text samples to validate your interaction filters to check you are recording the correct data into your index for analysis. You can also use these text samples to train machine learned classifiers. In this post we'll take a look at this new type of data and how to use it for validation of interaction filters.
What are Super Public text samples?
When you work with PYLON you start by creating a filter in CSDL that specifies which stories (posts) and engagements you'd like to record into your index for analysis. You cannot view the raw text of the stories for privacy reasons as these include non-public posts.
Super Public text samples are different because these are stories which the author has chosen to share publicly so for these stories we can give you access to the content. Stories are classed as Super Public if they are:
- Posted by someone who has “Who can see your future posts?” set to “Public” under their privacy settings
- Posted by someone who has chosen to make a specific post publicly viewable
- Posted by someone who has the Follow setting enabled, allowing non-friends to see their stories
- Not posted to someone else’s timeline
As you have access to the raw content of these stories they are very useful for validating the results of your analysis of the non-public stories and engagements.
Accessing Super Public text samples
If you have Super Public text samples enabled on your account, when you record data using an interaction filter into an index any Super Public stories that match the running interaction filter will be recorded in a separate cache alongside the index.
You can retrieve the Super Public stories using the pylon/sample endpoint. When you fetch stories they are deleted from the cache so you'll need to store these yourself for future usage. For this endpoint you are limited to fetching 100 stories per hour.
Validating interaction filters
Let's take a look at an example to make things clearer.
Perhaps you'd like to analyze audiences discussing the US presidential election. You could create a filter such as this:
fb.topics.name == "Democratic Party" OR fb.topics.name == "Democratic Party" OR fb.topics.name == "Republican National Committee" OR fb.topics.name == "Republican National Committee" OR fb.content contains_any "Clinton, Trump, Sanders, Carson, Bush"
Running this interaction filter as a recording will store any matching stories into your index. In addition, any Super Public stories that match the conditions will also be stored but into a separate cache.
Fetching text samples
Before you start analyzing the recorded data in your index you no doubt would like to validate the data you are recording. You can do this by fetching data from the pylon/sample endpoint:
curl -X GET 'https://api.datasift.com/v1.2/pylon/sample?hash=[hash for recording]' -H 'Content-type: application/json' -H 'Authorization: [your username]:[identity api key]'
Based on the example filter you might get the following Super Public story back:
In this case a story about George H. W. Bush has matched the filter, but we are only interested in Jeb Bush!
You can now amend the last clause of your interaction filter to include the phrase "Jeb Bush" and remove this noise:
fb.content contains_any "Clinton, Trump, Sanders, Carson, Jeb Bush"
You can see that Super Public text samples are going to be an important part of your workflow allowing you to validate the data you are recording. This is critical because you cannot access the raw text of the posts in your index.
In a future post we'll take a look at how you can use this data for training machine-learned classifiers so that you can identify sentiment, intention and emotion in Facebook posts.