Analyzing Data

PYLON supports many options for analyzing LinkedIn Engagement Insights data. In this guide you'll learn the options available to you.

Asynchronous analysis tasks

PYLON for LinkedIn Engagement Insights provides a large shared index of data for you to analyze. Due to the scale of the index it can take a few seconds for analysis requests to complete. To allow for these long-running analysis queries v1.4 of the DataSift API introduces the new Task API.

Using the Task API you run an analysis task by following these steps:

  • Submit an analysis task
  • Wait for the analysis task to complete
  • Fetch the results of the analysis task

In the next section we'll look at each of these steps in detail.

Running analysis tasks

Submitting analysis tasks

First we'll look at how you submit an analysis task.

The most common form of analysis you'll perform is a frequency distribution. A frequency distribution analysis allows you to count the volume of interactions and unique authors across a given target, for example a LinkedIn member's industry.

As an example let's submit a query to find which industries are most active (performing the most interactions) on LinkedIn. We'll use the li.user.member.employer_industry_names target which tells us a member's industry to perform the analysis, so that we can produce a result similar to this chart:

example_freqdist

Analysis tasks are submitted using the POST /pylon/linkedin/task endpoint.

The following codes uses the Python client library to submit the task:

# include the datasift libraries
import sys
from datasift import Client

# create a client
datasift = Client("ACCOUNT_API_USERNAME", "IDENTITY_API_KEY")

index_id = "INDEX_ID"

task_parameters = {
    'parameters': {
        'analysis_type': 'freqDist',
        'parameters': {
            'threshold': 5,
            'target': 'li.user.member.employer_industry_names'
        }
    }
}

# Start an analysis task based on the above parameters
task = datasift.pylon.task.create(
    subscription_id=index_id,
    name='Top industries',
    parameters=task_parameters,
    service='linkedin'
)

Note that a frequency distribution analysis accepts two parameters:

  • target - the target to analyze
  • threshold - the number of categories you'd like counts for (maximum of 200)

The targets you can use for analysis are listed in the targets explorer. The query will perform counts for all categories of the target and return the top n results as specified by the threshold parameter.

The response from the POST /pylon/linkedin/task API call (called by the pylon.task.create() method) will be a JSON object, containing the 40-character ID of the task you have just created.

{
  "id": "ee35db10ad1539a34200772f9090aa98ef25ff4c"
}

note icon

You will be subject to rate limits on the number of tasks you can submit in an hour. This limit can vary depending on the package you have purchased. More details about these rate limits can be found in our Understanding Limits guide.

Monitoring analysis tasks

Now we have a task ID, we can call the GET /pylon/linkedin/task/{id} endpoint to retrieve details and the status of this specific task.

datasift.pylon.task.get(task['id'], service='linkedin')

The API responds with details of the task, for example:

{
  "id": "ee35db10ad1539a34200772f9090aa98ef25ff4c",
  "identity_id": "6b99750ddb3b304a50cc06e7c884f925",
  "subscription_id": "919abc1226465c7baf6d908df7b67a134cf0b9e4",
  "name": "Top industries",
  "type": "analysis",
  "parameters": {
    "start": 1484221585,
    "end": 1484307985,
    "parameters": {
      "analysis_type": "freqDist",
      "parameters": {
        "target": "li.user.member.employer_industry_names",
        "threshold": 5
      }
    }
  },
  "result": null,
  "status": "queued",
  "created_at": 1484307825,
  "updated_at": 1484307825
}

Here we can see a number of the values we submitted as part of the analysis task have been returned, along with some additional details;

  • parameters.start - The start timestamp of the period we are analyzing. This defaults to 24 hours ago if a start time is not specified.
  • parameters.end - The end timestamp of the period we are analyzing. This defaults to "now" if an end time is not specified.
  • result - The results of this analysis task. This is currently null as the analysis task is still queued.
  • status - The current status of this analysis task. This is currently queued for processing.
  • created_at - The time at which the analysis task was submitted.
  • updated_at - The time at which the analysis task was last updated. At present, this task has received no updates. It will be updated with the task results when the analysis task is completed.

The analysis task's status will change from queued, to runnning, and finally to completed when processing is complete.

tip icon


You can use the GET /pylon/linkedin/task API endpoint to return a list of your recent tasks.

Retrieving analysis results

When your analysis task has a status of completed, calling the GET /pylon/linkedin/task/{id} endpoint will retrieve the results of the task. Below is an example of what your results may look like:

{
  "id": "ee35db10ad1539a34200772f9090aa98ef25ff4c",
  "identity_id": "6b99750ddb3b304a50cc06e7c884f925",
  "subscription_id": "919abc1226465c7baf6d908df7b67a134cf0b9e4",
  "name": "Top industries",
  "type": "analysis",
  "parameters": {
    "start": 1484221585,
    "end": 1484307985,
    "parameters": {
      "analysis_type": "freqDist",
      "parameters": {
        "target": "li.user.member.employer_industry_names",
        "threshold": 5
      }
    }
  },
  "result": {
    "interactions": 29543000,
    "unique_authors": 8408100,
    "analysis": {
      "analysis_type": "freqDist",
      "parameters": {
        "target": "li.user.member.employer_industry_names",
        "threshold": 5
      },
      "results": [
        {
          "key": "information technology and services",
          "interactions": 2264100,
          "unique_authors": 593900
        },
        {
          "key": "financial services",
          "interactions": 1152800,
          "unique_authors": 363200
        },
        {
          "key": "computer software",
          "interactions": 855400,
          "unique_authors": 230500
        },
        {
          "key": "staffing and recruiting",
          "interactions": 747300,
          "unique_authors": 142900
        },
        {
          "key": "higher education",
          "interactions": 717100,
          "unique_authors": 223700
        }
      ],
      "redacted": false
    }
  },
  "status": "completed",
  "created_at": 1484307985,
  "updated_at": 1484307989
}

The result object provides us with a number of useful fields;

  • interactions - The total number of interactions analyzed in this task
  • unique_authors - The number of unique authors analyzed in this task
  • analysis - Details of the analysis task parameters and results
  • analysis.results - The results of the analysis task

Each item within the analysis.results array has the following properties:

  • key - The identifier for the category (relating to the target used for the analysis)
  • interactions - The number of interactions for the category
  • unique_authors - The number of unique authors for the category

Analyze subsets of data

One of the key features of PYLON is the ability to split up an index and analyze subsets of data.

For example you could start by analyzing the top articles shared by an audience over time. To go further you could analyze the top links shared in an hour or day by specifying a time span, or analyze the top articles shared by each gender using query filters.

Applying query filters

Query filters allow you to filter data in an index before it is analyzed. Query filters are written in CSDL, although not all fields are available for filtering. A full list of filterable fields can be found in the Target Explorer

We can add a filter to our previous example so that only interactions from members in the financial, banking and investment industries are analyzed:

task_parameters = {
    'filter': 'li.root.user.member.employer_industry_names in "banking, financial services, investment banking, investment management"',
    'parameters': {
        'analysis_type': 'freqDist',
        'parameters': {
            'threshold': 5,
            'target': 'li.user.member.employer_industry_names'
        }
    }
}

datasift.pylon.task.create(
    subscription_id=index_id,
    name='Top industries - Engaging with "financial industry" members,
    parameters=task_parameters,
    service='linkedin'
)

You can include up to 30 conditions in the CSDL for your query filter. So for example you could specify males in some major California metro areas with this query filter:

li.user.member.gender == "male" AND li.user.member.metro_area IN "san francisco bay area, greater los angeles area, greater san diego area"

In practice you may end up submitting many queries with query filters analyzing lots of different subsets of the index, for example you might perform a query once for each country.

Specifying time spans

The examples above used default time periods for the analysis. You can use the start and end parameters to specify a time span you would like to analyze.

We can add to our last example by specifying a week for the time span:

task_parameters = {
    'filter': 'li.root.user.member.employer_industry_names in "banking, financial services, investment banking, investment management"',
    'start': 1483315200,
    'end': 1483920000,
    'parameters': {
        'analysis_type': 'freqDist',
        'parameters': {
            'threshold': 5,
            'target': 'li.user.member.employer_industry_names'
        }
    }
}

The values you can pass must be a valid unix timestamp, and must be a time within the index recording. i.e. between 30 days ago, and "now".
You can be as granular as you like with the time period to analyze, and really dig deep into the index. When you explore the data set initially it's best to start with a wide time period and reduce it gradually so you avoid hitting any redaction limits.

Start times are inclusive whereas end times are exclusive. Times are specified as UTC unix timestamps. Read our guide on calculating time spans to learn more on calculating these values.

tip icon

As a best practice we recommend you always specify a time span for your analysis. The parameters are optional but it is important you understand the exact time period you are analyzing to give accurate results.

Time series analysis

A time series analysis allows you to see how the volume of interactions and unique authors varies over time across the index.

example_timeseries

To perform a time series analysis you create an analysis task and specify "timeSeries" as the analysis type. For example using the Python client library:

# include the datasift libraries
import sys
from datasift import Client

# create a client
datasift = Client("ACCOUNT_API_USERNAME", "IDENTITY_API_KEY")

index_id = "INDEX_ID"

task_parameters = {
    'filter': 'li.all.articles.domain == "wsj.com"',
    'parameters': {
        'analysis_type': 'timeSeries',
        'parameters': {
            'interval': 'hour',
            'span': 1
        }
    }
}

# Start an analysis task based on the above parameters
task = datasift.pylon.task.create(
    subscription_id=index_id,
    name='Engagements with wsj.com articles',
    parameters=task_parameters,
    service='linkedin'
)

A time series analysis accepts two parameters which control the size of the intervals the time series will use:

  • interval - the units of the intervals to use
  • span - the number of units for the interval size

The parameters default to 'hour' for interval, and 1 for span. This mean counts will be returned in one hour intervals. This works nicely for a short period of time, for example one day. If you changed these to 'day' for interval and 2 for span, then you'd receive data counted for 2 day intervals.

You can combine these parameters however you like, but keep in mind the smaller the interval size, the less data you will have to analyze, so you may hit redaction limits. Start with larger intervals then reduce the interval size until you've reached your required resolution.

The results of a time series analysis task will be formatted like this example:

{
  "id": "39167cd4e2a6e2582308a9b0ccabb3ba94a90951",
  "identity_id": "6b99750ddb3b304a50cc06e7c884f925",
  "subscription_id": "919abc1226465c7baf6d908df7b67a134cf0b9e4",
  "name": "Engagements with wsj.com articles",
  "type": "analysis",
  "parameters": {
    "filter": "li.all.articles.domain == \"wsj.com\"",
    "start": 1484524800,
    "end": 1484560800,
    "parameters": {
      "analysis_type": "timeSeries",
      "parameters": {
        "interval": "hour",
        "span": 1
      }
    }
  },
  "result": {
    "interactions": 6700,
    "unique_authors": 5500,
    "analysis": {
      "analysis_type": "timeSeries",
      "parameters": {
        "interval": "hour",
        "span": 1
      },
      "results": [
        {
          "key": 1484524800,
          "interactions": 800,
          "unique_authors": 600
        },
        {
          "key": 1484528400,
          "interactions": 800,
          "unique_authors": 700
        },
        ...
      ],
      "redacted": false
    }
  },
  "status": "completed",
  "created_at": 1484561712,
  "updated_at": 1484561723
}

The top-level counts of unique_authors and interactions relate to the entire analysis. They show that your call processed a total of 6,700 interactions from your index, from 5,500 unique authors.

Notice that here the key for each result item is the start of the time period the counts apply to.

You can use query filters to generate multiple time series for comparison, for example comparing engagement by each seniority. In this case you would submit an analysis task for each seniority, each with a query filter specifying a certain seniority. You could then plot the results on one chart.

tip icon

When performing time series analysis with 'day' as your interval setting the platform will use midnight UTC as daily boundaries. You can use the offset parameter to specify a timezone and override this behaviour. See POST /pylon/{service}/task for more details.

Nested analysis queries

Nested analysis queries return Frequency Distribution results from multiple targets in one API call. Obtaining the same analysis results without nesting would require more API calls.

The maximum depth of nesting is three, which includes a parent, child and grandchild. A typical use case would be to identify the top member gender and age groups of each gender, for an analysis query target. For example to analyze the age-gender breakdown for a set of countries you would use the following nesting:

  • Top 3 member countries (parent)
    • Top 2 member genders (child)
      • Top 2 member age groups (grandchild)

The JSON object you would submit to create an analysis task with the nesting above would look like this:

{
  "subscription_id": "919abc1226465c7baf6d908df7b67a134cf0b9e4",
  "name": "Top age and gender by country",
  "type": "analysis",
  "parameters": {
    "parameters": {
      "analysis_type": "freqDist",
      "parameters": {
        "target": "li.user.member.country",
        "threshold": 3
      },
      "child": {
        "analysis_type": "freqDist",
        "parameters": {
          "target": "li.user.member.gender",
          "threshold": 2
        },
        "child": {
          "analysis_type": "freqDist",
          "parameters": {
            "target": "li.user.member.age",
            "threshold": 2
          }
        }
      }
    },
    "start": 1483315200,
    "end": 1484560800,
    "filter": "li.all.contents contains_any \"big data\" OR li.all.articles.summary contains_any \"big data\" OR li.all.articles.title contains_any \"big data\""
  }
}

The platform wil analyze the top 3 countries, for each country the top 2 genders, and for each country and gender combination the top 2 age groups.

The result is a nested set of counts, for example:

Parent Child Grandchild Interactions Unique Authors
united states male 35-54 35,600 22,100
25-34 21,300 13,500
female 35-54 10,300 6,800
25-34 8,300 5,500
india male 35-54 11,300 6,700
25-34 11,000 6,700
female 25-34 3,200 1,800
35-54 2,300 1,300
france male 35-54 9,700 5,300
25-34 6,800 4,100
female 35-54 3,300 2,000
25-34 2,900 1,700

Identifying Nested Analysis Targets

Any analysis target may be used as the parent target, but only a subset of low cardinality targets (fewer than 50 unique values) can be used as child targets. The PYLON Target Explorer tool lists all targets, the Properties section includes information about where a target may be used.

In the example, the li.user.member.age target may be used in Analysis and Child Analysis Queries, and also in Query Filters:

identifying_nested_targets

When a call is made to the POST /pylon/linkedin/task endpoint with nested parameters, the rate limit is the same as a simple analysis query.

Nested Frequency Distribution Analysis Limits

For details on limitations of nested analysis queries, please read the Nested frequency distribution results limit section of the Understanding Limits documentation.