Baselining Analysis Results

When you receive analysis results from PYLON these results represent the audience captured by your recording. Baselining adds to your analysis by surfacing how your audience is different to a comparison audience. For example if you're looking at the audience engaging with a brand, how does it compare to audiences engaging with other brands in the same market?

note icon

When describing this pattern we'll refer to the example automotive recording which is available in the analysis sandbox. This recording captures audiences discussing three car brands and uses CSDL similar to the interaction filter shown in step 1 of the solution.

Table of Contents

Why baseline analysis results?

If we perform a nested age-gender analysis against the example automotive recording for one of the three brands, in this case Honda, we receive the following result:

baselining-audience-age-gender

This chart tells us that 25-34 year old males are the largest group in the audience. However, common sense tells us that this group is likely to be engaged with automotive topics, and it is not clear whether this result is a feature of the individual brand or the general automotive audience as a whole.

To understand how the brand's audience compares to the wider automotive audience we can use a technique called baselining.

With baselining we repeat the same analysis query on a recording capturing the wider audience. We can then draw a chart comparing the two results:

baselining-baselined-age-gender

On the chart the grey bars represent the wider audience we're using for the comparison. The results have been normalized before plotting for easier comparison.

Baselining reveals that although 25-34 males are the largest group in the brand's audience, relatively this group is underrepresented compared to the general automotive audience. The result also tells us in general females over-index for this brand. These insights suggest an advertizing campaign targeted at females between 45 and 65 might be very effective.

So baselining has revealed insights that were simply not visible with our first analysis result.

When is baselining applicable?

Baselining is applicable whenever you'd like to compare a focused audience to a wider reference audience.

In particular baselining can be used to:

  • reveal demographic differences between your audience and the baseline audience
  • compare when your audience is active, as opposed to the reference audience
  • show how your audience breaks down across geographies in comparison to the reference audience

In the following solution we'll focus on comparing a brand audience in the example automotive index with a wider automotive audience regardless of brand. However, you could choose to baseline against any other audience you choose to record.

For example you may want to baseline your results for one country against a global audience, or compare one brand's audience to the audience for a set of luxury brands. You simply need to decide what audience to record in your comparison index and apply the same solution. In fact you could carry out baselining within one index, for example comparing one of the automotive brands to the remaining audience by making use of query filters in your analysis requests.

Solution

Now that we've established the concept of baselining let's take a look at a worked example. We'll record interactions for the audience we are studying to an index, then additionally record interactions for our baseline audience to a separate index.

Step 1: Recording data for your audience

Naturally the first step is to record data for the audience you're looking to analyze. Here we'll stick with the well-known automotive example and use the following CSDL for our interaction filter:

tag.automotive.brand "Ford" { fb.parent.content contains_any "Ford" or fb.content contains_any "Ford" } 
tag.automotive.brand "BMW" { fb.parent.content contains_any "BMW" or fb.content contains_any "BMW" } 
tag.automotive.brand "Honda" { fb.parent.content contains_any "Honda" or fb.content contains_any "Honda" } 

return { 
    ( 
        fb.parent.content contains_any "Ford, BMW, Honda" 
        OR fb.content contains_any "Ford, BMW, Honda" 
    ) 
    AND 
    ( 
        fb.topics.category in "Cars, Automotive" 
        OR fb.parent.topics.category in "Cars, Automotive" 
    ) 
    AND 
    ( 
        fb.author.country_code in "US" 
    ) 
}

This CSDL specifies that we want to record stories and engagements from US-based users that mention any of three well-known brands and that relate to car or automotive discussions.

We can use a query filter in our analysis queries to analyze only interactions that relate to Honda by making use of the tags we included:

interaction.tag_tree.automotive.brand == "Honda"

Step 2: Recording data for your baseline audience

The next step is to start a recording that can be used as the baseline audience.

For this example filter we'll record a broader audience engaging with automotive topics, regardless of the brand, again from US-based users:

return { 
    ( 
        fb.topics.category in "Cars, Automotive" 
        OR fb.parent.topics.category in "Cars, Automotive" 
    ) 
    AND 
    ( 
        fb.author.country_code in "US" 
    ) 
}

You might find that recording a broader audience quickly consumes your recording allowance.

Depending on the analysis you'd like to perform you can consider the following options:

  • Tune your recording to capture a smaller, but still representative audience. For instance in this case you could restrict the filter to capture a broader list of brands, perhaps 20 brands.
  • If you do not need to baseline time series analysis results you could create a new identity for the baseline recording with its own recording limit. The new identity's limit will protect your overall account recording limit, and even though the baseline recording may hit the limit, you will still have data in your index you can use to baseline frequency distribution analysis.

Step 3: Comparing absolute result values

Once you have data recorded to both indexes you can now start to submit analysis queries and perform your baselining.

Let's start by comparing results by their absolute values. In the next step we'll look at comparing normalized results.

The chart we began with showed an age-gender breakdown. We can draw this chart for each index and compare the shape of the distributions by eye.

analyze_parameters = {
    'analysis_type': 'freqDist',
    'parameters':
    {
        'threshold': 2,
        'target': 'fb.author.gender'
    },
    'child':
    {
        'analysis_type': 'freqDist',
        'parameters':
        {
            'threshold': 5,
            'target': 'fb.author.age'
        }
    }
}

audience = datasift.pylon.analyze('audience index id', analyze_parameters, filter='interaction.tag_tree.automotive.brand == "Honda"')

baseline = datasift.pylon.analyze('baseline index id', analyze_parameters)

The resulting chart for the Honda audience:

baselining-audience-age-gender

The resulting chart for the baseline index:

baselining-baseline-age-gender

You can see we've repeated the same query for each index. It is important that you ensure that your queries are identical or valid for your comparison. For instance ensure that you use the same start and end values so that you compare the same time period.

These charts are useful, but to really understand the results we need to normalize the distributions. This will also allow us to plot both results on the same chart, we'll look at this next.

When you begin to baseline your analysis you'll learn more about audiences on Facebook. For instance the profile of audiences that post stories is usually very different to the audience that engages. You may choose to carry out analysis separately for each of these groups. You can do this using query filters on your queries.

tip icon

You can read more about charting results from PYLON and see example code in our charting PYLON analysis results iPython notebook.

Step 4: Normalizing analysis results for comparison

We've looked at comparing analysis results using the absolute values that are returned from the API. However, for most use cases you'll want to normalize your results as the audiences you are comparing will differ in size.

Normalization allows you to focus on the differences between the audiences. For frequency distributions you can more easily identify under and overrepresented groups. For time series results you can more easily identify unusual peaks and troughs of activities.

There are many options for normalizing results but for most use cases we recommend you normalize against the sum of the results for a query.

So for example if you received the following result for a frequency distribution:

[
    {
        "interactions": 312200,
        "key": "18-24",
        "unique_authors": 259900
    },
    {
        "interactions": 539700,
        "key": "25-34",
        "unique_authors": 414100
    },
    {
        "interactions": 578600,
        "key": "35-44",
        "unique_authors": 475600
    },
    ...
]

Assuming you're analyzing the number of unique authors (which makes sense for most use cases) you would sum all of the unique_authors values for each category, then divide each individual category result by this total. The result would look as follows:

unique_authors normalized
key
18-24 259900 0.100678
25-34 414100 0.160411
35-44 475600 0.184234
...

The normalized values add to 1, so effectively here we have a probability distribution for the categories.

By normalizing the results for both your audience and baseline audiences you can now plot charts using the normalized values which compare the two results more clearly. Our age-gender breakdown now becomes:

baselining-baselined-age-gender

This brings us back to our original analysis that 25-34 males are underrepresented and females are generally overrepresented.

Take a look at the Additional Examples section below for examples of baselining time series and geographic analysis results.

tip icon

You can read more about normalization and see example code in our example baselining PYLON analysis results iPython notebook.

Considerations

When you carry out baselining it's important to keep in mind:

  • It is worth spending time tuning recordings of baseline audiences that you can use across your projects.
  • Consider comparing to a variety of audiences, for example to those engaging with different brands, topics or located in different geographies.
  • Make sure that when you submit analysis queries to each index you are comparing the same time period and are using the same query filters (if applicable).
  • Audiences often differ greatly by country. If you are analyzing a US audience in most cases you would want to use a broader US-only audience for your baselining.
  • Audiences who post stories versus those engage usually differ significantly. You might want to compare these two groups separately in your analysis.

Additional examples

In the solution above we focused on baselining the results of an age-gender breakdown performed using a nested query. Here we'll look at examples of other analysis results you might want to baseline.

Baselining time series analysis results

Using a time series analysis you can see audience activity over time. Baselining time series results is powerful as it reveals when your audience is more or less active than a broader audience.

Continuing with the automotive example you may find that the brand's audience is more active when a new model is released or a large recall takes place. You can investigate these peaks and troughs using baselining.

Just like the demographics example we can perform the same analysis query on both indexes:

analyze_parameters = {
    'analysis_type': 'timeSeries',
    'parameters':
    {
        'interval': 'hour',
        'span': 1
    }
}

audience_ts = datasift.pylon.analyze('audience index id', analyze_parameters, 'interaction.tag_tree.automotive.brand == "Honda"' )
baseline_ts = datasift.pylon.analyze('baseline index id', analyze_parameters)

We can easily display the results on a chart, for example comparing the number of interactions created over time:

baselining-timeseries-not-normalised

You can perform the same normalization on time series results. For most use cases you will want to normalize the interactions values using the same method ready for plotting.

With normalization the time series chart now becomes:

baselining-timeseries-normalised

Now we can clearly see a peak of activity for our audience around the 18th December.

Baselining geographic analysis results

You'll often want to investigate how your audience varies across geographies, baselining is important here too. For example when you analyze audiences in the USA, California and Texas often come out top as there are very active Facebook audiences in these states.

We can take a look at US states represented in our index like so:

analyze_parameters = {
  "analysis_type": "freqDist",
  "parameters": {
    "target": "fb.author.region",
    "threshold": 5
  }
}

region_audience = audience_client.pylon.analyze('audience index id', analyze_parameters)
region_baseline = baseline_client.pylon.analyze('baseline index id', analyze_parameters)

Plotting the results with normalization reveals that California is overrepresented in our audience:

baselining-geography-normalised

Baselining temporal analysis results

Building on the time series analysis example, for some use cases it can be interesting to see when activity peaks at times during the day. For example, do people discuss buying cars more frequently at certain times of the day? Again this is more easily revealed through baselining.

To perform this analysis you would need to record a general audience not simply focused on automotive topics. For example you could use Facebook engagement from users in California as your baseline index, then compare people in California discussing buying cars from the automotive index with your wider California index.

analyze_parameters = {
    'analysis_type': 'timeSeries',
    'parameters':
    {
        'span': 1,
        'interval': 'hour'
    }
}

filter_buy = ' fb.author.region == "California" AND fb.all.content contains_any "buy,new,purchase,deal,offer,offers" '
hours_audience = audience_client.pylon.analyze('audience index id', analyze_parameters, filter=filter_buy)
hours_baseline = baseline_client.pylon.analyze('california baseline index id', analyze_parameters)

In this case before we plot the chart we group interaction counts into hourly buckets to get total counts for each hour of the day. We also normalized the count before plotting as described in the solution above.

baselining-temporal-normalised_0

The x-axis shows the hour of the day in the GMT timezone. You can see there is a peak at 2am, which is 6pm in local time. We can conclude that the audience discusses buying cars in the early evening.

Resources

For further reading, take a look at the following resources:

You may also find it useful to test your analysis queries with the example automotive recording referenced in this pattern. Contact your sales representative or account manager for access to the 'analysis sandbox'.