Sampling - Best Practices

As Facebook topic data is such a high volume data source you can quickly hit your limits when recording a large audience. In this guide we look at how you can use sampling to record a representative sample of an audience for analysis.

What is sampling?

By default, when you run a recording all interactions that match the filter conditions will be recorded into your index.

There may be situations where you do not want to record every interaction. If you'd like to analyze a very large audience, recording every interaction may exceed an account, identity or platform limit. For instance, without sampling you may hit the platform indexing limit (a maximum of 1 million interactions can be recorded by any recording in one day). Or alternatively you may have set an identity recording limit for your end customer that you need to stay within.

Recording a proportion of the interactions will still give you representative results in your analysis, but will keep you within your limits. This technique is called sampling.

You can perform sampling using the interaction.sample and fb.sample targets. The first allows you to perform 'story-level' sampling, selecting a proportion of stories and all related engagements. The second allows you to perform 'engagement-level' sampling where you can select a proportion of engagements distinct from parent stories.

interaction.sample vs fb.sample

It is important to understand how values for the two sampling targets are assigned by the platform so that you can use the targets appropriately in your interaction filters.

sampling-targets

All stories that are received by the platform are assigned a random number between 0 and 100 for their interaction.sample target. This value is also assigned to the fb.sample target for the story.

Engagements are given the same value for the interaction.sample target as their parent story (as long as the parent story is present in the context cache). Engagements are assigned a distinct random number between 0 and 100 for their fb.sample target independent of the parent story.

Therefore stories and related engagements have the same value for the interaction.sample target, but independent values for the fb.sample target.

Story-level sampling

Because the value of interaction.sample is the same for a story and its related engagements, this target is suitable for story-level sampling where you want to record a random sample of stories and all the engagements which relate to these stories.

story-sampling

Story-level sampling tends to be suitable when you want to record a representation of a large audience such as for an entire industry or country. For example, imagine you are looking to study everyone discussing or engaging with movies to see which topics are trending. Your interaction will be broad, for example:

fb.topics.category in "Movie,Film" 
OR fb.parent.topics.category in "Movie,Film"

If you start a recording using this filter you will record a very large audience and be very likely to exceed your recording limit. As you are looking to identify trends amongst the audience you can instead record a representative sample of the audience, analyze the trends and safely assume these trends apply to the entire audience.

To do so you will want to record a random sample of the stories being posted and all the engagements on these stories. You can modify your filter to be as follows:

(fb.topics.category in "Movie,Film" 
OR fb.parent.topics.category in "Movie,Film") 
AND interaction.sample <= 5

Because engagements are given the same sample value as their parent stories this filter will match 5 percent of stories which mention movies, and match all the engagements on these selected stories.

Note that the example intentionally uses the less-than-or-equal-to (<=) operator to select interactions where the target value is up to and including 5.

Engagement-level sampling

For some use cases it can be helpful to record all stories, but only a sample of engagements for an audience.

engagement-sampling

For instance, if you are recording an audience for a global brand you will find that engagements (particularly likes) will use up a large portion of your recording allowance, leading you to miss out on recording important stories. You can use engagement-level sampling to ensure that you record all stories, but only a proportion of the engagements on these stories.

If you for example wanted to record the audience for a major brand you could start with the following filter:

fb.all.content contains_any "Coca-cola, coca cola"

You could change the filter so that only 10 percent of engagements are captured:

fb.content contains_any "Coca-cola, coca cola" 
OR ( fb.parent.content contains_any "Coca-cola, coca cola"  AND fb.sample <= 10 )

Because the fb.sample value is random for each individual engagement regardless of the parent story then you will record a 10 percent sample of engagements on matching stories, rather than 10 percent of engagements for each individual story.

You might choose to sample different types of engagements at different rates. For instance capturing all stories, 50 percent of comments and reshares, 10 percent of likes:

fb.all.content contains_any "Coca-cola, coca cola" 
AND 
( 
   fb.type == "story" 
   OR ( fb.type IN "comment,reshare" AND fb.sample <= 50 ) 
   OR ( fb.type IN "like" AND fb.sample <= 10 ) 
)

Examples

Below common sampling scenarios are discussed in detail. This table summarizes the scenarios.

Scenario Strategy Sampling target Explanation
Studying an industry-level audience Story-level interaction.sample Capturing a proportion of stories posted and all engagements on these stories will give you a good representation of the entire audience.
Monitoring health of global brand Engagement-level fb.sample You'll want to record all stories, as any one story could impact the brand. You can sample down engagements using fb.sample as with only a sample of engagement you can still see which stories are receiving a large volume of engagement.
Identifying viral content Engagement-level fb.sample You'll want to record all stories, as any story could go viral. You can sample down engagements using fb.sample as with only a sample of engagement you can still see which stories are receiving a large volume of engagement.
Balancing a global audience across countries Story-level interaction.sample Capturing a proportion of stories posted and all engagements on these stories in each country will give you a good representation of the audience in the respective countries.
Recording a demographic audience baseline Mixed interaction.sample The number of stories alone will likely exceed your limits so you will need to sample both stories and engagements. In most cases you will need to sample engagements at a lower rate than stories.

Studying an industry-level audience

Imagine you're looking to study a wide audience, such as everyone discussing and engaging with automotive topics, to find out which demographic groups are posting stories about and engaging with well-known brands.

If you attempt to record everyone discussing cars you would quickly use up your recording allowance. In this case capturing a proportion of stories posted and all engagements on these stories will give you a good representation of the entire audience to use for your analysis.

You can record a representative audience using story-level sampling:

(fb.topics.category in "Cars, Automotive" 
OR fb.parent.topics.category in "Cars, Automotive") 
AND interaction.sample <= 0.5

Here the interaction.sample target is used to capture 0.5 percent of stories that mention cars and all engagements on these stories.

You can perform a typical age-gender breakdown analysis on both authors posting stories, and authors engaging with these stories.

brand-ag_0

If you are keen to record a good proportion of stories, but would like to more aggressively sample the engagements on these stories you could consider the following CSDL for your filter:

( 
   fb.topics.category in "Cars, Automotive" 
   OR fb.parent.topics.category in "Cars, Automotive" 
) 
AND 
( 
   interaction.sample <= 0.5 
   AND (fb.type == "story" or fb.sample <= 50) 
)

Here the interaction.sample target is used to filter to only 0.5 percent of stories and all engagements on these stories, then additionally the fb.sample target is used to select only 50 percent of the engagements captured by the first sampling condition (so 0.25 percent of all automotive related engagements). This works because engagements have to satisfy both the conditions for interaction.sample and fb.sample.

Monitoring health of a global brand

Imagine you're working with a global brand identifying stories which are receiving a lot of engagement and are therefore impacting the brand's reputation.

If you attempt to record every story and engagement relating to the brand you could exceed your recording limit, particularly if a story suddenly goes viral. In this case it is critical that you record every story, but you only need to record a proportion of engagements across these stories to identify which are receiving a large amount of engagement.

This filter will capture all stories that mention the brand and uses the fb.sample target to capture only 10 percent of the related engagements:

fb.content contains_any "BMW" 
OR ( fb.parent.content contains_any "BMW" AND fb.sample <= 10 )

Running this recording you will be able to analyze for example the top domains of links that are being shared, based on the engagement they receive.

top-auto-domains

Identifying viral content

Imagine you are looking to identify content relating to a topic or industry that is going viral.

If you attempt to record every story and engagement for your topic you could exceed your recording limit. Just like the brand health use case, it is critical that you record every story as any could go viral, but you only need to record a proportion of engagements across these stories to identify which are receiving a large amount of engagement.

The identifying influential content design pattern suggests this filter to capture movie-related content that is being shared:

( 
    fb.topics.category in "Movie,Film,TV/Movie Award" 
    OR fb.parent.topics.category in "Movie,Film,TV/Movie Award" 
) 
AND links.url exists

You can modify this filter to record all stories, but only 5 percent of the related engagements across these stories:

(
   fb.topics.category in "Movie,Film,TV/Movie Award" 
   OR fb.parent.topics.category in "Movie,Film,TV/Movie Award" 
) 
AND links.url exists 
AND ( fb.type == "story" OR fb.sample <= 5)

Running this recording you will be able to analyze for example the top links that are being shared, based on the engagement they receive.

ca-top-links

Balancing an audience across countries

Imagine you are working with a global brand looking to identify trends in a number of target markets, for example in South America countries. You run an initial recording and notice that you hit your recording limit and your recorded data is dominated by authors from Brazil. You want to ensure you are recording adequate data from each country so you can analyze trends in each market.

You can use story-level sampling to record a different proportion of interactions from each country:

fb.all.content contains_any "BMW" 
AND 
( 
   (fb.author.country == "Brazil" AND interaction.sample <= 1) 
   OR (fb.author.country == "Argentina" AND interaction.sample <= 2) 
   OR (fb.author.country == "Chile" AND interaction.sample <= 10) 
)

This filter records 1 percent of stories from Brazil, 2 percent of stories from Argentina and 10 percent of stories from Chile, along with all the related engagements for the captures stories.

You can now analyze the topics and content that are trending in each country.

Recording a demographic audience baseline

Baselining your analysis results against a wider population helps you show what is unique about your audience. For most baselining scenarios you will want to record a large audience to compare against, for example the entire Facebook population. Of course recording such a large audience would exceed your limits, so sampling is essential.

The baseline audience you want to record will depend on the analysis you are looking to compare. A typical example is comparing the age-gender breakdown of your audience to the Facebook population as a whole.

To record such a baseline you will need to sample both stories and engagements as the number of stories alone will likely exceed your limits. Because the volume of engagements far exceeds the volume of stories it's good practice to sample engagements relatively less to the stories:

(fb.type == "story" and interaction.sample < 0.04) 
OR (fb.type != "story" and interaction.sample < 0.008)

The first line of this filter record 0.04 percent of stories, the second line records 0.008 percent of engagements. Because the filter uses the interaction.sample target and because a greater percentage of stories is being recorded then all engagements will be hydrated.

You can now use this baseline in your analysis results, for example baselining the age-gender breakdown of people posting stories.

ca-ag-baselined

Avoiding confusing results

Sampling can become confusing if you mix and match sampling rates in your filter. This is because sampling of engagements relies on how stories are stored in the context cache.

To avoid confusion we recommend you keep to one of these three approaches regardless of your use case:

  • record a sample of stories and all related engagements (story-level sampling).
  • record all stories and a sample of engagements on these stories (engagement-level sampling).
  • record a greater proportion of stories than engagements, ensuring you use the interaction.sample target, as shown in the baseline example above.

To illustrate how sampling can be confusing, consider this filter:

(fb.content contains_any "Coca-cola, coca cola" AND fb.sample <= 10) 
OR (fb.parent.content contains_any "Coca-cola, coca cola"  AND fb.sample <= 10 )

Here 10 percent of stories are recorded and are therefore stored in the context cache. As engagements arrive only 10 percent of parent stories are present in the context cache, so only 10 percent of the engagements are hydrated with values from their parent stories. The fb.parent.content clause can only select 10 percent of the engagements as 90 percent of engagements will have a missing value for this target. The sampling clause then selects only 10 percent of the remaining engagements, so in fact only 1 percent of engagements are recorded.

From this example you can see that it is important that the matching stories for the engagements you are trying to record are present in the context cache.

If you use the interaction.sample target for story-level sampling the platform ensures that the value for this target is the same for stories and the related engagements, so the stories required for hydrating engagements will be present in the cache.

If you use the fb.sample target to sample engagements make sure the related stories are also recorded so that they will be available in the context cache to hydrate the engagements.

Does sampling distort my results?

When you apply sampling to your recordings you may be concerned you will distort your results or lose signal.

The only way to check distortion is to compare a sampled audience against the entire audience. Our data science team regularly performs this check for sampled audiences, taking slices of the audience and confirming a representative audience is recorded.

For instance this chart shows the result of comparing a 1 percent sample of an audience to the entire audience from which the sample was taken:

1-percent-sample

The grey bars represent the audience that has not been sampled. You can see the sampled audience shown in the foreground closely matches the entire audience, demonstrating that sampling does provide a representative audience.

To ensure you do not lose signal you need to ensure you are employing the correct sampling technique for your use case. For example, in the brand health example above it is important that all stories are recorded, whereas only a proportion of engagements across all stories are required to capture the key signals for the use case. The use of fb.sample reflects the priorities of the use case.

Analyzing sampled audiences

If you include sampling rules in your interaction filters, when you come to analyze the recorded interactions you need to remember to allow for sampling when you interpret the results.

Where possible we recommend you normalize your results against the size of your audience so that you interpret the distribution of results, rather than the individual values.

For instance continuing the global audience example above where audiences are sampled at different rates for each country, we can analyze the age gender breakdown of the audience in each country and plot the normalized results for easy comparison.

2-countries-ag-norm

You can read more about normalization in our baselining design pattern.

If normalization is not an option you need to consider how you will scale the results of your analysis to reflect sampling rates.

For instance, if you included a rule which specified a 10 percent sample using the interaction.sample target, and the result of a frequency distribution analysis using the fb.language target is as follows:

{
    "unique_authors": 2300, 
    "analysis": {
        "analysis_type": "freqDist", 
        "redacted": false, 
        "results": [
            {
                "unique_authors": 2900, 
                "key": "es", 
                "interactions": 2400
            },
            ...
        ], 
        "parameters": {
            "threshold": 4, 
            "target": "fb.language"
        }
    }, 
    "interactions": 3500
}

The fb.language target is only applicable to stories, so only stories will be included in the analysis. You also know that you only recorded a 10 percent sample of the stories. The number of unique authors reported for Spanish is 2,900. Adjusting this value for the sample rate gives 29,000 assuming the sampled audience is representative.

Sampling super public text samples

When you use sampling targets in an interaction filter the same sampling will be applied to super public text samples that are cached. For example, if you include a rule which specifies a 10 percent sample using the interaction.sample target, then 10 percent of applicable super public posts will be cached.