Validating Interaction Filters Using Super Public Posts

Learning from super public post content

PYLON's privacy model prevents you from accessing raw story content. This key feature of PYLON is important to protect audiences, but presents challenges when you try to improve your filters and tags.

How can you verify the stories you are recording are relevant to your analysis? How can you check your tags are working as you expected? Analysis queries go a long way to helping you understand what data is in your index, but nothing beats seeing the verbatim text for stories.

Super public posts provide you with raw sample content by giving you access to stories authors have chosen to make public. Any super public posts that match the interaction filter of a running recording will be stored in a cache separate to the recording's index. You can access the cache of super public posts using the pylon/sample endpoint.

Each post includes the raw story content, detected topics and extracted links. Your tagging rules are also run against the sample recordings. For example, here's a super public post from an automotive recording:

{
    "fb": {
        "content": "I love how the rear seats fold flat in the BMW X5",
        "language": "en",
        "hashtags": [ "bmw" ],
        "topics": { "name": "BMW" },
        "topic_ids": [ 565634324 ]
    },
    "interaction": {
        "media_type": "photo",
        "subtype": "story",
        "content": "I love how the rear seats fold flat in the BMW X5",
        "created_at": "Thu, 21 Jan 2016 16:36:04 +0000",
        "id": "079701744092c80b6ee07044959243c3"
    },
    "tag_tree": {
        "automotive": {
            "bmw": {
                "X_Series": [ "X5" ]
            }
        }
    },
    "links": {
        "code": [ 200 ],
        "domain": [ "carbuyer.co.uk" ],
        "normalized_url": [ "http://www.carbuyer.co.uk/reviews/bmw/x5/suv/review" ],
        "url": [ "http://www.carbuyer.co.uk/reviews/bmw/x5/suv/review" ]
    }
}

Using super public posts you can improve your interaction filters, removing noise from your recordings (and therefore improving the accuracy of your analysis) and expanding your filter to capture additional relevant interactions for your use case. You can also use super public posts to improve the accuracy of your tagging rules and to identify new custom topics to classify for use in your analysis.

As an example let's imagine we're looking to record and analyze stories relating to sports shoes so that we can analyze share of voice for each brand in this market. We want to create an interaction filter that captures real users engaging with the brands as opposed to noisy posts attempting to sell shoes.

We could start with the following CSDL for our interaction filter:

fb.topics.name in "adidas,Reebok,Nike" 
AND fb.content contains_any "shoes,sneakers,trainers,pair"

We run a recording using this filter and hit the pylon/sample endpoint. We find noisy posts such as…

Adidas stansmith shoes 
Premuim quality deadmactch  100%
# not legit but same as original 100%
READY TO ORDER? Please fill up this order form 

With example posts such as this we can remove the noise from our recording by adding new filter clauses. We can repeat this process to tune our interaction filter and improve our analysis results.

In this guide we'll take a look at how you can collect super public posts for your recordings and use these to firstly remove noise from your recordings, and secondly to broaden your filter to capture further signal.

When is this pattern applicable?

You should consider using super public samples to test and improve your interaction filters every time you create a new recording.

In particular, super public posts can be used to:

  • Remove noise from your recordings
  • Validate tagging rules you have included in your interaction filter
  • Broaden your interaction filters to include further terms and topics
  • Build new tags to extract custom topics for your analysis

In the following walkthrough we'll focus on improving an interaction filter by removing noise and broadening the terms used to capture further relevant interactions.

See the additional examples section below for further use cases of super public posts.

Solution

With the sports shoes example in mind, let's take a look at how in practice super public posts can be used to improve our filter.

Here we'll focus on improving an interaction filter based on the content of super public posts. You could also improve your filter by looking at the result of analysis queries you submit to your index we won't cover this in this guide.

Step 1: Start an initial recording

We want to use super public posts to iteratively improve an interaction filter. Super public posts are automatically collected for running recordings, so to get started we need to create a first version of our interaction filter and start a recording.

Let's start with our example CSDL from above:

fb.topics.name in "adidas,Reebok,Nike" 
AND fb.content contains_any "shoes,sneakers,trainers,pair"

Here we are using topics to record stories that relate to any of three well-known sports brands and are using keywords that relate to shoes. We start a recording running using this filter.

When recording stories for such well-known brands it is easy to hit recording limits. To protect your account recording limit we recommend that you record the data using an identity which has a recording limit applied.

It's worth remembering that limits are applied separately to super public posts. If your identity is limited to recording 100,000 interactions a day, it is also limited to caching 100,000 super public posts a day. If your recording hits the 100,000 recording limit super public posts will continue to be cached until the super public post limit is reached.

warning icon

Super public posts do not come with authors' demographic details. If your interaction filter contains a clause where you filter to an age group or location then no super public posts will match your filter and no super public posts will be cached. You may need to create a separate recording where the filter has no demographic clauses to collect super public posts.

Step 2: Retrieve super public posts

Now that we have data being recorded and super public posts being cached we next need to retrieve the super public posts from the API.

We are limited to retrieving 100 super public posts for the recording per hour, so it may take a number of hours to collect enough posts to study.

To collect samples over a period of time we create a script that we can run as a long-running task. Here's an example script written in Python that uses the DataSift client library:

from datasift import Client

# Create a DataSift client
client = Client("username", "identity api key")

# This method retrieves super public posts
def retrieve_samples(filepath, id, start=None, end=None, filter=None):
    try:
        response = client.pylon.sample(id, count=100, start=start, end=end, filter=filter)
        with open(filepath, "a") as outfile:
            for i in response['interactions']:
                json.dump(i, outfile, default=json_serial)
                outfile.write("\n")
    except DataSiftApiException as e:
        if e.response.status_code == 429:
            print "Hit sample retrieval limit."
        else:
            print "Exception:", e.message

# Request super public, once an hour, 24 times
for h in range(24):
    retrieve_samples('output.json', id, filter='fb.language=="en"')
    time.sleep(3601) # wait for 1 hour

Notice here we have used the filter parameter to retrieve only posts in English. The filter parameter is very powerful, allowing you to apply the same query filters you would when analyzing an index.

Typically you'll want to run a process overnight to collect enough samples.

warning icon

Note that whenever you retrieve sample posts from the API the posts you receive will be removed from the cache. Therefore whenever you hit the pylon/sample endpoint we recommend you store the retrieved posts for later usage.

Step 3: Removing general noise from your recording

When you have collected a good number of posts you'll want to inspect these posts to understand which interactions your interaction filter is matching.

Before you look at the terms and phrases that are particular to your use case you might spot some typical characteristics of posts.

Firstly, the length of the post is often a good indicator of noise. Typically users expressing opinions do so in short stories, rather than long stories which usually are posted by people trying to sell items.

A sample post from our recording reads as follows (in fact this post has be truncated and is hundreds of characters long):

New arrival - Adidas d rose 6 
Premuim quality deadmactch  100%
# not legit but same as original 100%
"Ghost pair" high grade perfect replica
Oem(original equipment manufactured)
Mens 7,8,8.5,9.5,10,11,12
Womens 5 to 9
Kids 36 to 40
Actual photo posted...

We can exclude stories with long content from our recording by adding a clause to our filter:

fb.topics.name in "adidas,Reebok,Nike" 
AND fb.content contains_any "shoes,sneakers,trainers,pair" \\nAND fb.content regex_exact ".{,200}"

This clause will filter out any content over 200 characters long.

Secondly use of many uppercase letters is a typical indication of noise. Another post from our example reads:

NEW ADIDAS ZX FLUX FLORAL NPS PRINT MENS ORIGINAL SUPERSTAR B34467 SHOES

Again we can add a new clause to our filter to remove such stories:

fb.topics.name in "adidas,Reebok,Nike" 
AND fb.content contains_any "shoes,sneakers,trainers,pair" \\nAND fb.content regex_exact ".{,200}" \\nAND NOT fb.content regex_partial "[A-Z\\s]{25}"

This clause will remove any posts with at least 25 uppercase letters.

Step 4: Removing use-case specific noise from your recording

Next let's look at noise that is specific to our use case. What is particular about the posts we are matching that signifies noise?

Another post from our example filter reads:

Delivering tomorrow 8am till 10pm..
• Emporio Armani EA7 tracksuits 
• Adidas trainers
• Women's Nike Roshe trainers 
• Women's Timberland boots

In fact just by reading a few samples there is clearly a significant proportion of posts that mention delivery, order forms, pricing.

Again we can use this knowledge to improve our interaction filter:

fb.topics.name in "adidas,Reebok,Nike" 
AND fb.content contains_any "shoes,sneakers,trainers,pair" 
AND fb.content regex_exact ".{,200}" 
AND NOT fb.content regex_partial "[A-Z\\s]{25}" 
AND NOT fb.content contains_any "order, orders, delivery, delivering, prices, brand new, replicas, replica, brand new, original equipment, high quality"

Another interesting aspect of the posts in this example is the prevalence of shoe sizes. A person may mention the size of their shoes in a post perhaps, but any posts that mention a range of shoe sizes are very likely to be spam.

An example post reads:

Adidas shoes for sale...
Mens 7,8,8.5,9.5,10,11,12
Womens 5 to 9
Kids 36 to 40

We can add another clause to filter out these stories:

fb.topics.name in "adidas,Reebok,Nike" 
AND fb.content contains_any "shoes,sneakers,trainers,pair" 
AND fb.content regex_exact ".{,200}" 
AND NOT fb.content regex_partial "[A-Z\\s]{25}" 
AND NOT fb.content contains_any "order, orders, delivery, delivering, prices, brand new, replicas, replica, brand new, original equipment, high quality" 
AND NOT fb.content regex_partial "(\\d+(\\.5)?,?\\s?)+" 
AND NOT regex_partial "\\d+(\\.5)?\\s?(to|-)\\s?\\d+(\\.5)?"

Here we use regular expressions to match phrases in the form of:

  • 1,2,3,4,5.5 - Lists of shoe sizes
  • 1 to 10 - Ranges of shoe sizes
  • 1 - 10 - Ranges of shoe sizes

Step 5: Broadening your interaction filter

When looking at super public posts you'll often find terms and phrases you hadn't thought of including upfront but may help you capture a broader audience.

An example post from our first filter reads:

Just got the sickest <strong>sneaks</strong> through. Adidas Tubular Doom X CNY. These shoes are so nice!

Language evolves and it can be difficult to keep up with the latest words used by your audience. This story was matched by our filter because of the keyword 'shoes', but there could be many stories we missed that just use the keyword 'sneaks'.

It's also worth bearing in mind that misspelling is common, especially as people often post stories using their smart phone. From reading the sample posts for our recording we noticed these misspellings:

  • trainers: trainners
  • shoes: shoess, shooes
  • sneakers: sneekers, sneeks

We'll now broaden our filter to include some terms that appear in the super public samples that weren't on our original keyword list:

fb.topics.name in "adidas,Reebok,Nike" 
AND fb.content contains_any "shoes,sneakers,trainers,pair,sneaks,kicks,jordans,sneaker,trainners,shoess,shooes,sneekers,sneeks" 
AND fb.content regex_exact ".{,200}" 
AND NOT fb.content regex_partial "[A-Z\\s]{25}" 
AND NOT fb.content contains_any "order, orders, delivery, delivering, prices, brand new, replicas, replica, brand new, original equipment, high quality" 
AND NOT fb.content regex_partial "(\\d+(\\.5)?,?\\s?)+" AND NOT regex_partial "\\d+(\\.5)?\\s?(to|-)\\s?\\d+(\\.5)?"

Step 6: Iteratively improving your interaction filter

Now that we've spent some time improving our example interaction filter we now start a new recording and ensure that our changes have improved the results.

Our changes may have removed enough noise to reveal new less obvious features that indicate noise which we can also exclude from our recording in our next iteration. When we broadened our filter to include new terms we may have inadvertently introduced new sources of noise, which we now need to remove.

You should iteratively improve your filter until you are satisfied with the results you are receiving.

note icon

In this process we've taken the approach of iteratively improving our interaction filter by updating the filter conditions. Another approach, as long as you have adequate capacity, would be to add tags for interactions you consider noise. You could then use these tags in query filters for your analysis to exclude the interactions as you see fit.

Considerations

Things to consider when making use of super public posts:

  • Although the process can take some time it is definitely worth investing time perfecting your filter as this will greatly improve your analysis results.
  • Even when you are happy with your filter you should consider periodically checking long-running recordings to ensure you continue to record relevant data.
  • Remember that super public posts do not include demographic details. If your interaction filter contains a clause that filters by demographic properties you will not receive super public posts.
  • Make use of the pylon/sample endpoint filter parameter to retrieve posts that are most relevant to you.
  • Combine your study of super public samples with analysis results from your index when improving your filter.

Additional examples

The solution above walked you through improving an interaction filter. Super public data can be used for a variety of other purposes.

Improve your tags

You can use super public posts to check that your tag rules are running as expected. Supposing we had added the following tags to our interaction filter:

tag.sports.brand "Adidas" { fb.content contains_any "Adidas" } 
tag.sports.brand "Nike" { fb.content contains_any "Nike" } 
tag.sports.brand "Reebok" { fb.content contains_any "Reebok" }

If a super public post had arrived with the following content:

I love my new Addidas trainers - they make me run so much faster than my old Reebok trainers!!

Then, the misspelling of Adidas means the post won't have been tagged. Inspecting the JSON for this post we can see this is the case, as only Reebok is tagged:

{
"fb": {
        "content": "I love my new Addidas trainers - they make me run so much faster than my old Reebok trainers!!",
        "language": "en"
},
"tag_tree": {
        "sports": {
            "brand": [ "Reebok" ]
        }
}
}

We could improve our tags by including misspellings of the brand names and could consider adding topics as part of our tag conditions.

Investigating a particular timespan

Often when you analyze an index you'll see peaks of activity that you'll want to investigate. One way to do so is to use the start and end parameters for the pylon/sample endpoint which allow you to retrieve posts from a specific time range.

For example you could retrieve samples for a particular hour:

start = 1454083200 # Friday, 29th January 2016, 16:00
end = 1454086800   # Friday, 29th January 2016, 17:00

samples = client.pylon.sample(id, count=100, start=start, end=end)

This approach is particularly useful when tracking live events.

Adding tags for custom topics

Often you'll want to go deeper than just studying brands, for example investigating products. Unless products are very well known they are unlikely to appear in the topic graph and so you will need to add tagging rules to your filter to identify mentions of them.

Super public posts are useful for seeing how products are discussed. Products are often misspelled in posts and often mentions to not fit the branded product name.

For example Yeezy Boost is range of Adidas products endorsed by Kanye West. Looking at our super public samples we can pick out terms used by authors and create a tag rule:

tag.sports.adidas "Yeezy Boost" { fb.content contains_any "yeezy,yeezyboost,yeezys, yezzys,kanye" }

Repeating this process for ranges of products would give us a richer recording to analyze.

Resources

For further reading, take a look at the following resources:

We will be publishing further patterns that explain other uses of super public posts soon.