Analyzing Data

PYLON supports many options for analyzing your recorded data. In this guide you'll learn the options available to you.

Analyzing data in PYLON

When you work with PYLON you start by recording data you'd like to analyze into an index. Once you have data recorded into an index you use analysis queries to retrieve aggregated results from your index.

PYLON supports time series and frequency distribution analysis of data.

Analysis can be performed on your entire index or subsets of data. You can use query filters written in CSDL to filter interactions before analysis takes place. You can also specify a time span for your analysis.

As you'll often want to perform multidimensional analysis of your data you can make use of nested queries. Nested queries provide an efficient way to analyze a number of dimensions at once in one API call.

Submitting analysis queries

All analysis queries are submitted using the pylon/analyze endpoint.

For each query you specify:

  • id - the id represents the index you are querying
  • parameters - specifies the type of the analysis and any related parameters
  • filter - an optional query filter to run before analysis
  • start, end - an optional time span for the analysis

Results are returned from the endpoint in JSON format. For each analysis result you receive counts for:

  • interactions - the count of interactions that are applicable
  • unique authors - the count of unique authors across the interactions

We'll look at the filter, start and end parameters later. Let's start with looking at each type of analysis.

Time series

A time series analysis allows you to see how the volume of interactions and unique authors varies over time across your recorded data.

ca-timeseries

To perform a time series analysis you call the analyze endpoint and specify "timeSeries" as the analysis type. For example using the Python client library:

from datasift import Client
datasift = Client("your username", "identity API key")
analyze_parameters = {
    'analysis_type': 'timeSeries',
    'parameters':
    {
        'interval': 'hour',
        'span': 1
    }
}
print(datasift.pylon.analyze('recording id', analyze_parameters))

A time series analysis accepts two parameters which control the size of the intervals the time series will use:

  • interval - the units of the intervals to use
  • span - the number of units for the interval size

The parameters default to 'hour' for interval, and 1 for span. This mean counts will be returned in one hour intervals. This works nicely for a short period of time, for example one day. If you changed these to 'day' for interval and 2 for span, then you'd receive data counted for 2 day intervals.

You can combine these parameters however you like, but keep in mind the smaller the interval size, the less data you will have to analyze, so you may hit audience-size gate limits. Start with larger intervals then reduce the interval size until you've reached your required resolution.

When you submit a query the results will be formatted like this example:

{
   "unique_authors":520500,
   "interactions":766400,
   "analysis":{
      "analysis_type":"timeSeries",
      "redacted":false,
      "results":[
         {
            "unique_authors":25900,
            "key":1431622800,
            "interactions":32320
         },
         {
            "unique_authors":30500,
            "key":1431626400,
            "interactions":37600
         },
         ...
      ],
      "parameters":{
         "interval":"hour",
         "span":1
      }
   }
}

The top-level counts of unique_authors and interactions relate to the entire analysis. They show that your call processed a total of 766,400 interactions from your index, from 520,500 unique authors.

You can use query filters to generate multiple time series for comparison, for example comparing engagement by each gender. In this case you would submit two analysis queries, the first with a query filter specifying males, the second specifying females. You could then plot the results on one chart.

tip icon


When performing time series analysis with 'day' as your interval setting the platform will use midnight UTC as daily boundaries. You can use the offset parameter to specify a timezone and override this behaviour. See pylon/analyze for more details.

Frequency distributions

A frequency distribution analysis allows you to count the volume of interactions and unique authors across a number of categories, for example author genders, topics or tags you've added for classification.

ca-sources

To perform a frequency distribution analysis you call the analyze endpoint and specify "freqDist" as the analysis type.

from datasift import Client
datasift = Client("your username", "identity API key")
analyze_parameters = {
    'analysis_type': 'freqDist',
    'parameters':
    {
        'target': 'fb.author.gender',
        'threshold': 3
    }
}
print(datasift.pylon.analyze('recording id', analyze_parameters))

A frequency distribution analysis accepts two parameters:

  • target - the target to analyze
  • threshold - the number of categories you'd like counts for (maximum of 200)

The targets you can use for analysis are listed on the targets page. The query will perform counts for all categories of the target and return the top n results as specified by the threshold parameter.

When you submit a query the results will be formatted like this example:

{
    "interactions": 77631,
    "unique_authors": 44575,
    "analysis": {
        "analysis_type": "freqDist",
        "parameters": {
            "threshold": 3,
            "target": "fb.author.gender"
        },
        "result": [
            {
                "key": "female",
                "interactions": 35288,
                "unique_authors": 17851
            },
            ...
        ],
        "redacted": false
    }
}

Again the top-level counts of unique_authors and interactions tell you how much data has been processed in the analysis.

tip icon

When analyzing data it's important you understand exactly what data you are working with. For example when using Facebook topic data, do you want to analyze stories, engagements or both? For instance analyzing fb.language will analyze only story interactions and fb.parent.language will only analyze content from stories that are being engaged with. This is because an engagement does not have the fb.language target and a story does not have the fb.parent.language target so the two sets are mutually exclusive. However, because both stories and engagements have the fb.author.* targets if you use these for your analysis you will analyze both.

Take a look at our examples and guide on Facebook topic data for more details.

Analyzing subsets of data

One of the key features of PYLON is the ability to split up your index and analyze subsets of data.

For example you could start by analyzing the top links shared by an audience over time. To go further you could analyze the top links shared in an hour by specifying a time span, or analyze the top links shared by each gender using query filters.

Specifying a time span

The examples above used default time periods for the analysis. You can use the start and end parameters to specify a time span you would like to analyze.

We can add to our last example by specifying an hour for the time span:

datasift.pylon.analyze('recording id', analyze_parameters, <strong>start</strong>=1446562800, <strong>end</strong>=1446566400)

There are no restrictions on the values you can pass (aside from the dates cannot be in the future). So you can be as granular as you like with the time period to analyze, and really dig deep into your index. When you explore your data set initially it's best to start with a wide time period and reduce it gradually so you avoid hitting the audience-size gate limit.

Start times are inclusive whereas end times are exclusive. Times are specified as UTC unix timestamps. Read our guide on calculating time spans to learn more on calculating these values.

tip icon


As a best practice we recommend you always specify a time span for your analysis. The parameters are optional but it is important you understand the exact time period you are analyzing to give accurate results.

Using query filters

Query filters allow you to filter data in your index before it is analyzed. Query filters are written in CSDL and are similar to interaction filters, although not all the same targets and operators are available.

We can add to our last example by specifying a query for each gender, each with the applicable query filter.

results_male = datasift.pylon.analyze('recording id', analyze_parameters, filter='fb.author.gender == "male"')
results_female = datasift.pylon.analyze('recording id', analyze_parameters, filter='fb.author.gender == "female"')

You can include up to 30 conditions in the CSDL for your query filter. So for example you could specify males in California with this query filter:

fb.author.gender == "male" AND fb.author.region == "California"

In practice you may end up submitting many queries with query filters analyzing lots of different subsets of your index, for example you might perform a query once for each country.

Analysis using nested queries

As well as submitting individual queries, the platform also supports nested queries. Nested queries allow you to perform frequency distribution analysis at multiple levels.

If for instance you'd like to break down data by gender then analyze by age within each gender group a nested query allows you to do this in one call to the API.

ca-ag

To perform a nested query you specify a child argument for the parent analysis:

from datasift import Client
datasift = Client("your username", "identity API key")
analyze_parameters = {
    'analysis_type': 'freqDist',
    'parameters':
    {
        'threshold': 2,
        'target': 'fb.author.gender'
    },
    'child':
    {
        'analysis_type': 'freqDist',
        'parameters':
        {
            'threshold': 5,
            'target': 'fb.author.age'
        }
    }
}
print(datasift.pylon.analyze('recording id', analyze_parameters))

Take a look at our nested analysis queries guide to learn more.

Analysis query limits

Redaction and quantization

Analysis results from PYLON are subject to rules that protect the privacy of authors.

Any result from the API must represent at least 1000 unique authors. If you submit a query and the result would not represent 1000 unique authors then the result will be redacted. If a result is redacted you will see this in the API response:

{
   "analysis": {
        ...
        "redacted": true
    }
}

In addtion each individual data point within an analysis must represent at least 100 unique authors otherwise it will be omitted from results. If you for example submit a frequency distribution query and although the result is not redacted there are categories which are missing, then these categories will have been omitted for this reason.

All unique author counts are rounded to the nearest 100 this is called quantization.

Read our audience-size gate limit guide for more details on how this limit is applied and how to work within it.

API limits

When you start to build production applications with PYLON you'll start submitting many analysis queries. You might need to submit 100s of queries to build a complex dashboard.

You account is subject to two API rate limits - the first is a general API limit for all API calls, the second is a specific limit for calls to the /pylon/analyze endpoint.

You can read more about limits on our Platform Allowances page.

Solution design considerations

Templating analysis queries

In your solution it's likely you'll want to analyze the same sets of targets repeatedly, but do so for different slices of your data.

For instance imagine you are analyzing the audience for a brand and it's competitors. For each brand you'll want to perform the same analysis and compare the results.

It makes sense to abstract this idea in your code. For example you could regularly call this method (written in pseudo-code) for each brand:

analyze_brand (brand):
  analyze_gender(gender,brand)
  analyze_regions(brand)
  analyze_age(brand)

It's definitely worth designing your code with reusable templates in mind, as in practice for any use case you'll no doubt want to perform many more queries on a regularly basis.

Caching analysis results

You should consider caching the results of any analysis queries you submit. If you cache results you can avoid repeating the same queries so lessen the chance of hitting your API limits. You can also use cached results to build a more responsive user interface for your users.

You could choose to submit queries every 5 minutes to provide a up-to-date set of analysis results.

Storing analysis results long term

You'll be aware that data is only retained in indexes for 32 days, effectively an index is a 'rolling window' of data recorded from your filter. If you're looking to provide long-term results to end users then you cannot use your index to do so. You need to regularly store results from analysis queries into your own data store and serve these to your users.

Next steps…

This guide has given you a good start for understanding analysis in PYLON. There is a huge range of analysis options open to you given the number of targets you can analyze and the number of ways you can split the data in your index.

Learn more about the possibilities through these resources: