Using Sampling to Monitor Share of Voice for Large Brands

Measuring the strength of a brand against competitor brands is a common form of brand analysis. If you are looking to monitor share of voice for a number of large brands you may find that recording every brand mention exceeds your recording limits. In this pattern we'll look at how you can use sampling to record a representative sample of mentions for each brand and still perform accurate share of voice analysis as shown in the share of voice analysis pattern.

Using sampling to study large audiences

If you attempt to record every mention of a global brand in stories and the related engagements you will quickly hit recording limits. This is particularly true if you are looking to serve many end customers, and have created identity limits to portion your allowance accordingly. A global brand may receive hundreds of thousands of engagements in one day, and yet you may have set an identity limit of 100,000 interactions a day to serve your end customer. In such a scenario PYLON's sampling features allow you to tackle this challenge.

Imagine you are studying share of voice for three global brands, such as BMW, Ford and Honda. These brands receive hundreds of thousands of engagements each day on Facebook.

To analyze the share of voice each brand achieves it is not necessary to record every story and engagement. Recording a fair proportion of stories and engagements across the three brands using sampling still allows you to accurately analyze the share of voice for each.

Choosing a sampling strategy

Before you start working on your solution, it is important to consider what analysis you are looking to perform so that you select an appropriate sampling strategy.

In this case you are looking to perform share of voice analysis, and so need to be able to answer the following questions with your analysis:

  • What is the share of voice for each brand across our entire audience?
  • How has the share of voice for each brand changed over time?
  • How does brand share of voice vary across location?
  • How does brand share of voice vary across demographic segments?

So you will need a strategy that will record a fair proportion of stories and engagements across the three brands, across all locations we want to include, and across all demographic segments.

In the majority of use cases, if your limits allow, we recommend you record all stories and a sample of engagements. This gives you the broadest, most representative dataset for analysis. In this case you should aim to record all stories, but if you do not have capacity to do so you can sample both stories and engagements and still record enough data for accurate analysis.

Also worth keeping in mind is, what is the impact on your analysis if you do hit your limit? For some use cases, such as brand monitoring it is essential that you capture every story, however for this use case missing a small amount of data would not have a huge impact. So when you choose your final sampling rate, yes you will want to allow yourself plenty of headroom for peaks in activity, but if you do happen to hit your limit your analysis will still be valid.

Read more about sampling strategies in our best practice guide.

note icon

It's worth mentioning at this point that sampling is not always the answer if you are hitting your capacity limits. Before you consider sampling, make sure that you are recording only the interactions you require. For instance, if you are looking to analyze a US audience, but have not specified a condition in your interaction using then you will be recording authors from all countries. Simply tightening up your interaction filter may remove the need for you to apply sampling.

When is this pattern applicable?

Share of voice analysis is a common form of analysis typically used for brand monitoring (or brand health) and market research scenarios. This form of analysis is shown in the share of voice analysis pattern.

This pattern specifically applies when you are looking to perform analysis for a set of brands which receive a large volume of mentions or engagement, and recording these interactions will exceed your recording limits.

This pattern shows you how to work within the bounds of recording limits, applying sampling techniques and still producing accurate analysis results.


To see how you can apply sampling to perform share of voice analysis we'll continue to work with the automotive example we discussed in the share of voice analysis pattern.

Step 1: Create an identity limit

As you are working with large volumes of data, but potentially serving multiple customers with your account, we recommend that you use identity limits to protect recording capacity.

For this example, imagine you are onboarding a new end customer who you want to provide ongoing share of voice analysis to. You want to be able to serve the customer within a recording allowance of 100,000 interactions per day.

You can create an identity for the customer and apply the recording limit to the new identity. For example in Python:

from datasift import Client
datasift = Client("your username", "your account API key")
identity = datasift.account.identity.create('End customer name')
token = datasift.account.identity.token.create(identity['id'], 'facebook', '<Facebook Token>')
datasift.account.identity.limit.create(identity['id'], 'facebook', 100000)

You can use this identity to run exploratory recordings to calculate a suitable sampling rate, and then use the identity to run your ongoing recording to serve the customer once they have been onboarded.

tip icon

The managing identities guide explains how you can create identities and assign service limits to protect your account capacity.

Step 2: Run an exploratory recording

Before you decide on a sampling strategy your next step is to record an initial set of data so that you can understand the options you can consider.

Define an interaction filter for your exploratory recording that includes all the conditions you require, but at this point excluding any sampling conditions.

Continuing the automotive example you could define a filter to record stories and engagements for three car brands:

tag.automotive.brand "Ford" { fb.all.content contains_any "Ford" } 
tag.automotive.brand "BMW" { fb.all.content contains_any "BMW" } 
tag.automotive.brand "Honda" { fb.all.content contains_any "Honda" } 
return { 
            fb.all.content contains_any "Ford, BMW, Honda" 
        AND ( 
            fb.topics.category in "Cars, Automotive" 
            fb.parent.topics.category in "Cars, Automotive" 

Note that this filter will record interactions from any country. You may only be interested in interactions from authors in particular locations, or conversing in certain languages. Make sure you include any such conditions at this point so that you can record a realistic benchmark before you apply sampling. It may turn out that once you have the correct conditions in place sampling is not required.

Start a recording using the identity you created in step 1 so that your exploration does not impact any other customers you are serving.

Step 3: Assess your exploratory recording

If you use the CSDL above for your interaction filter you will soon hit the recording limit you assigned to your identity, but you can use the data from your first recording to work towards a sampling rate that you can use for your final filter definition.

You can see the rate at which you are consuming your allowance by running a time series analysis query:

analyze_parameters = {
  "analysis_type": "timeSeries",
  "parameters": {
    "interval": "hour",
    "span": 1

total = client.pylon.analyze(id, analyze_parameters)

For example you may see a result like so:

This chart shows interactions are being recorded for a couple of hours each day until the recording limit is hit. Once the limit is hit no more interactions are recorded until the daily limit reset takes place.

You can repeat the analysis with query filters to see what proportion of the interactions are stories and engagements:

stories = client.pylon.analyze(id, analyze_parameters, filter='fb.type == "story"')
engagements = client.pylon.analyze(id, analyze_parameters, filter='fb.type != "story"')

The analysis shows that the vast majority of interactions are engagements.

Looking at the data returned from the analysis queries in tabular form shows you what volumes are being recorded:

You can see that before the recording limit was breached a peak of approximately 21,000 interactions were being recorded per hour. So over 24 hours in the day the recording limit of 100,000 was clearly going to be breached. You'll can also see that far fewer stories are recorded than engagements - 1,100 during the peak hour.

Your next step is to use these initial results to calculate a sample rate to try to ensure you can record a whole day of data. In the majority of use cases, if you can avoid sampling stories then we recommend you sample engagements only, this gives you the broadest set of data for analysis. In this example you have capacity to record all stories and a sample of engagements.

Looking at the day with most activity, in the 5 hours before limits were hit 10,500 stories were recorded (an average of 2,100 stories per hour) and 89,200 engagements were recorded (an average of 17,840 engagements per hour).

Assuming this is a peak day you can estimate:

Stories = 2,100 interactions * 24 hours = 50,400 interactions / day

This leaves 49,600 interactions / day to record as many engagements as possible. Again allowing for headroom we can expect up to 20,000 engagements per hour and calculate a sampling rate accordingly:

Sample rate = 49,600 / (20,000 engagements * 24 hours) = 10.33%

Therefore if you apply a sampling rate of 10% to engagements you should be able to record a day of activity within your 100,000 interaction limit.

Step 4: Introduce sampling to your interaction filter

At this point you need to update your original interaction filter to include your sampling rate.

Applying 10 percent sampling to engagements will give you:

tag.automotive.brand "Ford" { fb.all.content contains_any "Ford" } 
tag.automotive.brand "BMW" { fb.all.content contains_any "BMW" } 
tag.automotive.brand "Honda" { fb.all.content contains_any "Honda" } 

return { 
         fb.all.content contains_any "Ford, BMW, Honda" 
      AND ( 
         fb.topics.category in "Cars, Automotive" 
         fb.parent.topics.category in "Cars, Automotive" 
   AND (fb.type == "story" or fb.sample <= 10) 

The new condition added at the end of the filter says to record every story, but for interactions that are not stories (therefore engagements; likes, comments, reshares) only record a random 10 percent sample of these across all stories.

Update your filter, noting when you made the change, and continue to run your recording with the sampling rate now introduced.

tip icon

If your exploratory recording reveals you do not have enough capacity to record all stories, then you could instead apply 'story-level' sampling to your recording.

Step 5: Assess your sampled recording

To see if your sampling has been successful you can repeat the analysis you performed above.

Make sure you apply the correct timespan to your analysis queries, setting the start time as when you updated your filter to include sampling.

You may see a result like so:

Here you can see that limits were not breached until the morning of May 15th. This suggests the sampling applied to the filter was successful, except on the day of peak activity in this period.

Looking at the numbers in detail:

You can see that the recording limit was hit before 6am GMT, which is over 2 hours before the limit reset time of midnight Pacific Standard Time. In the course of 21 hours on this day before the limit was reached 33,900 stories and 64,700 engagements were recorded.

Step 6: Iterate to a final sampling rate

At this point you need to decide how aggressively you want to sample engagements.

If you change the sampling rate in the filter to 7% then this would avoid hitting the recording limit on a day of similar peak activity. However as in general there tends to be ten times more engagements than stories on Facebook, as a rule of thumb we suggest that you don't sample at less than 10% of engagements. Otherwise you will record more stories than engagements and lose quality in your analysis.

Depending on your exact use case you may want to let your filter run for a long period of time to ensure that you understand what level activity can occur at peak times.

Whatever you decide you can use the process explained above to iterate to your final filter.

For further explanation of sampling options and strategies read our best practice guide.

Step 7: Analyzing your recorded data

Now that you have data being recorded you can start your analysis. You can perform the same analysis explained in the analyzing share of voice pattern, except that when you add sampling rules to your interaction filters you need to remember to allow for this when you interpret the results of your analysis.

For example you can perform a frequency distribution analysis using the brand tags in the filter:

client = Client("DataSift username", "Identity API key")

analyze_parameters = {
  "analysis_type": "freqDist",
  "parameters": {
    "target": "interaction.tag_tree.automotive.brand",
    "threshold": 3

result = client.pylon.analyze('index id', analyze_parameters, filter='fb.type != "story"')

Notice that here a query filter is used to analyze engagements only. Because you're sampling engagements but not stories it usually makes sense to analyze stories and engagements independently.

If you leave your recording for a number of days you might see results such as this:

As you are sampling engagements at 10 percent, then you can multiply the number of interactions by 10 to get the non-sampled result. The true number of unique authors isn't quite so straight-forward as the authors are deduplicated across the interactions recorded. You can multiply the number of unique authors to get a representative result, but you should keep this subtlety in mind.

Where possible we recommend you normalize your results against the size of your audience so that you interpret the distribution of results, rather than the individual values. This approach works well for share of voice analysis as you are interested in the relative size of audiences for each brand.

For this example plotting the analysis results shows the relative share of voice and audience for each brand.

The concept of normalizing analysis results is shown in detail in our baselining example workbook.

You can repeat the same analysis as shown in the analyzing share of voice pattern to complete your solution.


When applying sampling for a share of voice use case keep in mind:

  • Consider up front what analysis you are looking to perform.
  • Base your sampling strategy on your required analysis.
  • Is it absolutely necessary to use sampling? It may be that tightening your filter conditions will allow you to work within your limits without sampling being required.
  • Consider allowing your recording to hit limits on occasion. For this use case it is not essential you capture every piece of data.
  • If you do not have capacity to record every story then you can apply story-level sampling and still achieve accurate results.
  • Take into account the sampling you have applied when interpreting your analysis results.


For further reading, take a look at the following resources: