Building an Archive of Analysis Results

A key feature of PYLON's privacy model is that data is removed from your index after 32 days. If you are looking to provide analysis results to your customers for further in the past you will need to consider how you can work within this limitation. In this article we'll look at how you can build an analysis archive so that you can store and serve analysis results over long periods of time.

Why create an analysis archive?

You'll be aware that PYLON automatically deletes interactions from your recordings after 32 days. You can in fact think of a recording as a moving conveyor belt that holds 32 days of data but no more. When you submit an analysis query you can analyze any time period within the last 32 days but no further in the past.

If you are building a product for your customers this presents a challenge.

The idea being that you regularly submit a set of analysis queries which you store the results of in your archive. You then draw your charts from the data in the archive.

architecture

Although building an archive does require some work it does bring a number of advantages:

  • Performance - Analysis queries are computationally expensive and can take a number of seconds to return from the platform. You can improve the performance of your application by serving results from your archive.
  • Rate limit conservation - Repeatedly hitting pylon/analyze will use up your rate limit. You can serve analysis results you frequently need from your archive and reduce your API calls.
  • Future proofing - You may not need an archive of results right now, but it's very likely you will in future. Starting an archive now gives you more options for improving your product going forwards.

In fact, as best practice we recommend you serve results in your application from a cache or archive to give the best experience to your customers.

Solution

Let's look at a worked example of how you can implement an analysis archive.

In this example we'll look at how you can create a time series which shows data from over 32 days in the past, and how you can allow users to 'zoom in' to periods to look at results in detail.

Of course there are many visualizations you might want to create, but you'll see that it is critical to decide what results you will need to display upfront, then design your analysis queries accordingly to fill your archive with results.

Let's assume for this example that your are serving reports for a brand and that you are recording brand mentions to your index using an interaction filter such as:

fb.all.content contains "BMW" AND fb.author.country == "United States"

Step 1 - Design your required visualizations

As stated above the first step is to decide what data you need to present to your user.

For this example let's imagine you'd like to create a time series which shows the volume of mentions of a brand over a long period of time. You will plot the volume of interactions recorded into your index each day, and will display the last past six months of data.

You also want your users to be able to click on any day on the time series to 'zoom in' and see a breakdown of the audience during this period. You will plot the number of unique_authors in each demographic group on your chart.

chart

To be able to draw the main chart you will need to store the volume of interactions (mentions) each day in your archive. To be able to draw the zoomed-in chart you will need to store the results of an age-gender analysis each day in your archive. You will need to do this on an ongoing basis so that when a user looks at the chart in your application there is always the last 6 months of data available for viewing.

Notice that we're looking to store the daily number of mentions, and display a detailed analysis for any given day. Keeping the resolution of the time series and the detailed result the same makes our job much easier.

Step 2 - Design your database schema

Now that you've decided on the charts you want to display you can design your database schema to match. You may be using a NoSQL database, but for now let's assume you are using a SQL database schema to illustrate the principles.

For your main time series you will need a table in your schema like so:

Date interactions unique_authors
2016/07/01 13200 12500
2016/07/02 14100 13800
... ... ...

You can then query this table for a range of dates and return the interaction count for each date to plot on the chart. It's not necessary to store the unique_author count, but here we've included it as you may find it useful in future.

For your zoomed in chart you will need a table like so:

Date gender age interactions unique_authors
2016/07/01 male 18-24 2100 2000
2016/07/01 female 18-24 1400 1200
... ... ... ... ...

In this table there will be a row for each demographic group for each day. So each day will require 12 rows (6 age groups multiplied by 2 genders).

When your user selects a date you can query this table for the date and return the 12 unique author counts to plot on your chart.

Step 3 - Design your analysis queries

Now that you have your schema designed you can consider your analysis queries..

In this case you can retrieve data the necessary data for both charts by submitting one daily query, a nested age-gender breakdown. Results for the query will be in the following format:

{
    "analysis": {
        "analysis_type": "freqDist",
        "parameters": {
            "target": "fb.author.gender",
            "threshold": 2
        },
        "results": [...]
    },
    "interactions": 215600,
    "unique_authors": 161000
}

The top-level count of interactions can be used for your main time series, and the detailed analysis results for your zoomed-in chart.

It's important that when you submit your query each day you correctly set the start and end parameters for your pylon/analyze call. Your start and end parameters need to express a time period that is 24 hours long, to equal the day you are querying for.

For example you are submitting a query on the 5th of July 2016 at 09:00, your start time would be 2016/07/04 00:00:00 and your end time would be 2016/07/05 00:00:00. Your analysis would then analyze data captured on the previous day. Of course you need to also factor in timezones into your calculations.

So for example in Python:

from datasift import Client
datasift = Client("your username", "identity API key")

start = // yesterday 00:00:00
end = // today 00:00:00

analyze_parameters = {
    'analysis_type': 'freqDist',
    'parameters':
    {
        'threshold': 2,
        'target': 'fb.author.gender'
    },
    'child':
    {
        'analysis_type': 'freqDist',
        'parameters':
        {
            'threshold': 6,
            'target': 'fb.author.age'
        }
    }
}

daily_result = datasift.pylon.analyze('recording id', analyze_parameters, start=start, end=end))

// save result to DB

Step 4 - Fill your archive with data

All that is left to do now is to start running the analysis query on a regular basis once a day.

Over time your archive will be filled with daily results and you can begin to serve charts with the data in your product or reports.

You might consider at the point running your queries in a batch for the last 30 days. This would populate your archive with backfilled set of data. Otherwise you would need to wait a number of days to display meaningful results on your charts.

Considerations

When you implement your archive in a production application keep in mind:

  • It is likely that you will update the filter for your recording over time. When you do so, make sure you record the time and date of each update so that you can keep this in mind when using results in your archive. Clearly if you change your filter then results before and after have a different meaning.
  • Even if you do not plan to serve long term results, consider building an archive to reduce your API calls and provide fast performance for your users.
  • You will need to decide on a strategy to clean up your archive over time. The amount of data stored in the example is minimal, but if you have many customers and many charts to serve your archive could grow significantly. Think about remove results which are no longer needed by your customers.

Additional examples

Avoiding double-counting of authors with roll-up analysis queries

In the example above we use daily queries to fill our archive. What happens though if you want to provide a summary to your customer of a different time period? For example, perhaps you want to show the audience breakdown for a week not a day.

You could simply sum the unique authors across the results for 7 different days. However, if you do there is a risk you will double-count authors, as the same author might be part of two distinct daily analysis results.

You might consider this is a small enough risk to ignore. Another approach though would be to introduce a new query to your schedule, a weekly age-gender breakdown you will also store in your archive.

start = // 7 days ago, 00:00:00
end = // today 00:00:00

You could also submit this query daily so that you have a weekly summary you can retrieve on any day.

You could also consider adding a query for monthly roll-ups as this is a typical time span your customers might request.

Multi-dimensional analysis

The example used by the solution above was straight forward. Let's quickly consider a more complex example.

Let's imagine you are in fact comparing three brands. Your time series needs to show a line for each brand, and when a user clicks on a day you will show an audience breakdown for each brand.

The table for your main time series chart will now be:

Date Brand interactions unique_authors
2016/07/01 brand 1 13200 12500
2016/07/01 brand 2 14100 13800
2016/07/01 brand 3 11000 10800
... ... ... ...

And for your zoomed-in chart:

Date Brand gender age interactions unique_authors
2016/07/01 brand 1 male 18-24 2100 2000
2016/07/01 brand 2 female 18-24 1400 1200
... ... ... ... ... ...

Essentially for each table we've added an additional dimension for the brand the result relates to.

To fill these tables you would submit the same analysis query used in the solution, except that you would submit it 3 times a day, once for each brand. You would specify the brand in the filter parameter for the query.

from datasift import Client
datasift = Client("your username", "identity API key")

start = // yesterday 00:00:00
end = // today 00:00:00

analyze_parameters = {
    'analysis_type': 'freqDist',
    'parameters':
    {
        'threshold': 2,
        'target': 'fb.author.gender'
    },
    'child':
    {
        'analysis_type': 'freqDist',
        'parameters':
        {
            'threshold': 5,
            'target': 'fb.author.age'
        }
    }
}

brand1_result = datasift.pylon.analyze('recording id', analyze_parameters, start=start, end=end, filter='interaction.tag_tree.brand == "brand 1"'))

brand2_result = datasift.pylon.analyze('recording id', analyze_parameters, start=start, end=end, filter='interaction.tag_tree.brand == "brand 2"'))

brand3_result = datasift.pylon.analyze('recording id', analyze_parameters, start=start, end=end, filter='interaction.tag_tree.brand == "brand 3"'))

Again the top level interaction and unique author counts returned by the queries could be used for the main chart, and the detailed nested results for the zoomed-in charts.

This solution is a little more complex, but in principle the solution is very much the same.

Resources

For further reading, take a look at the following resources: