Nested Analysis Queries in PYLON

Ed Stenson | 17th December 2015

In this blog I'm going to introduce nested analysis queries in PYLON. These are queries that allow you to delve more than one level deep. The example I'll use will analyze authors by gender and then divide them further into age demographic ranges.

PYLON is DataSift's privacy first technology that allows you to look at Facebook data on an aggregate level without visibility of the details of individual people. Think of it like this: you can hear what a crowd is saying without zooming in on individuals.

If you're new to the PYLON platform you can discover the basics very quickly with our PYLON 101 and Get Started guides. There are two main steps we'll consider here are:

  1. Run an Interaction Filter to grab the Facebook data you want to study and store it in an index.
  2. Run an Analysis Query against your index.
pylon

For example, if an index includes content about hit TV shows you could write an Analysis Query to ask what are the 10 things people mention most commonly alongside Game of Thrones?

The Interaction Filter in step 1 is a piece of CSDL code that you provide as a string to the /pylon/compile endpoint. The Analysis Query in step 2 needs to be wrapped in a JSON object and passed to the /pylon/analyze endpoint.

To summarize:

This filter or query: Does this: It is in this form: With this endpoint:
Interaction Filter Populates your index String /pylon/compile
Analysis Query Analyzes your index JSON object /pylon/analyze

We have some PYLON Interaction Filters running perpetually to make sure there's always example data available in indexes for internal testing. One of them looks at the automotive sector. By using this index I can skip step 1 entirely.

The fact that I didn't have to start from scratch is important. I'm using an index that's populated by an Interaction Filter that has been running for a while, and that has already stored plenty of data in the index. This makes life easy because our privacy rules would kick in if the index doesn't have enough data which means that Analysis Query results will be 'redacted'. They don't return data unless there's enough content from enough unique authors to make it unrealistic for anyone to attempt to de-anonymize the results.

While redaction is a great idea, it can trip you up at first. Here's my initial call to the /pylon/analyze endpoint:

curl -X POST https://api.datasift.com/v1.2/pylon/analyze 
    -d '{ "hash": "521631745ae3368ed63f6764f5a11f8e", "parameters": { "analysis_type": "freqDist", "parameters": { "threshold": 3, "target": "fb.author.gender" } } }' 
    -H 'Authorization: id:api_key' 
    -H "Content-type: application/json"

The JSON it returned looked like this:

{
    "interactions": 0,
    "unique_authors": 0,
    "analysis": {
            "analysis_type": "freqDist",
            "parameters": {
                "target": "fb.author.gender",
                "threshold": 3
            },
            "results": [],
            "redacted": true
    }
}

It reiterates my original Analysis Query and that final key-value pair

"redacted": true

is unequivocal but, otherwise, this JSON data doesn't really tell us anything else. The problem here is that I didn't specify the start parameter in my call so, by default, the Analysis Query looked at the last 24 hours stored in the index. Evidently there was not enough data there so the query failed.

Adding a start parameter so that the query covers more than the last 24 hours of data:

curl -X POST https://api.datasift.com/v1.2/pylon/analyze 
    -d '{ "hash": "521631745ae3368ed63f6764f5a11f8e", "parameters": { "analysis_type": "freqDist", "parameters": { "threshold": 3, "target": "fb.author.gender" } }, "start": "1449587555" }' 
    -H 'Authorization: id:api_key' 
    -H "Content-type: application/json"

The JSON return object looked like this:

{
    "interactions": 5211000,
    "unique_authors": 3465900,
    "analysis": {
            "analysis_type": "freqDist",
            "parameters": {
                "target": "fb.author.gender",
                "threshold": 3
            },
            "results": [{
                "key": "male",
                "interactions": 3695200,
                "unique_authors": 2390100
            }, {
                "key": "female",
                "interactions": 1403100,
                "unique_authors": 1039600
            }, {
                "key": "unknown",
                "interactions": 42900,
                "unique_authors": 29900
            }],
            "redacted": false
    }
}

This divides the population genders into three classes: male, female, and undetected/unavailable. Now that we know the genders, we could make three further API calls, looking at an age breakdown for women first, then for men, and then for the "unknown" group.

However, it's more efficient to take a step back and create a single query that does the whole job in one call, finding the genders first and then breaking them down into age demographic categories.

PYLON's nested queries allow you to do this. Here's the call. I set the threshold to 2 to restrict our results to just male and female, in order to reduce the JSON output to a manageable size. Note the way the child key shows where the top-level query ends and the lower-level query begins:

curl -X POST https://api.datasift.com/v1.2/pylon/analyze 
    -d ' { "hash": "521631745ae3368ed63f6764f5a11f8e", "parameters": { "analysis_type": "freqDist", "parameters": { "threshold": 2, "target": "fb.author.gender" }, "child": { "analysis_type": "freqDist", "parameters": { "threshold": 9, "target": "fb.author.age" } } }, "start": "1447675740" } ' 
    -H 'Authorization: id:api_key' 
    -H "Content-type: application/json"

The JSON return object clearly shows the gender/age breakdown:

{
    "interactions": 14240500,
    "unique_authors": 8389600,
    "analysis": {
            "analysis_type": "freqDist",
            "parameters": {
                "target": "fb.author.gender",
                "threshold": 2
            },
            "results": [{
                "key": "male",
                "interactions": 10148600,
                "unique_authors": 5383500,
                "child": {
                    "analysis_type": "freqDist",
                    "parameters": {
                        "target": "fb.author.age",
                        "threshold": 9
                    },
                    "results": [{
                        "key": "18-24",
                        "interactions": 3201700,
                        "unique_authors": 1656100
                    }, {
                        "key": "25-34",
                        "interactions": 3112000,
                        "unique_authors": 1734500
                    }, {
                        "key": "35-44",
                        "interactions": 1958400,
                        "unique_authors": 1018500
                    }, {
                        "key": "45-54",
                        "interactions": 1101700,
                        "unique_authors": 557400
                    }, {
                        "key": "55-64",
                        "interactions": 498600,
                        "unique_authors": 244700
                    }, {
                        "key": "65+",
                        "interactions": 275900,
                        "unique_authors": 145500
                    }],
                    "redacted": false
                }
            }, {
                "key": "female",
                "interactions": 3766800,
                "unique_authors": 2853800,
                "child": {
                    "analysis_type": "freqDist",
                    "parameters": {
                        "target": "fb.author.age",
                        "threshold": 9
                    },
                    "results": [{
                        "key": "25-34",
                        "interactions": 963500,
                        "unique_authors": 722300
                    }, {
                        "key": "18-24",
                        "interactions": 823700,
                        "unique_authors": 635800
                    }, {
                        "key": "35-44",
                        "interactions": 782800,
                        "unique_authors": 599000
                    }, {
                        "key": "45-54",
                        "interactions": 601300,
                        "unique_authors": 480600
                    }, {
                        "key": "55-64",
                        "interactions": 372600,
                        "unique_authors": 260100
                    }, {
                        "key": "65+",
                        "interactions": 222800,
                        "unique_authors": 158300
                    }],
                    "redacted": false
                }
            }],
            "redacted": false
    }
}

There are a couple of things to note here. First, not all targets are available for use as child targets. If a target is "low cardinality" it is likely to be available. For example, fb.author.gender is low cardinality, producing just three output values so you could use it as a child target in a nested query. You can check whether a target is available as a child in nested queries on our PYLON targets page. Alternatively, you can check on each target page in the Resource Information box:

resourceinformation

For targets that are not available for use as child targets in nested queries you can always resort to the technique I mentioned above, making the top-level (non-nested) call to /pylon/analyze first, storing the results, and then making further calls individually. For example, suppose you wanted a breakdown of topics mentioned in links found in stories. Neither of these targets are available as children in nested queries:

  • fb.link
  • fb.topics.category

But you could perform a breakdown of categories by link using non-nested queries only:

  1. Call /pylon/analyze using fb.link.
  2. Store all the returned values.
  3. For each return value, call /pylon/analyze again.
  4. Merge the results ready for visualization.

In summary, nested queries allow you to perform powerful analysis in a single API call. The example here used just one level of nesting but PYLON allows two levels. You could extend the example call to perform gender breakdown, age breakdown, and then sentiment analysis in a single API call. Use our PYLON targets page to check which targets are available as child targets for nesting. And remember that if a target isn't available for nesting, you can still use it in a nested analysis query as long as you keep it at the top level of your query.

To find out more, take a look at our Analysis Using Nested Queries page.


Previous post: Investigating Audience Snacking Habits with Facebook Topic Data

Next post: Using Facebook Topic Data to Refine an Advertising Campaign