Filter Swapping (part 2)

Ed Stenson | 15th March 2016

Use Case: Hashtags

In part 1 of this blog I introduced a new feature of PYLON called filter swapping. In this part I'll look at a use case that reflects a real-world use case.

Suppose you want to monitor a sector and find the top hashtags associated with it. Then you want to take those top hashtags and see how they're used in a wider context, across the whole of Facebook, not just in connection with the sector you chose. However, you know that the list of hashtags is likely to change over time, with some hashtags dropping away and new ones appearing so, effectively, the argument in your filter is a dynamic list.

A simple way to perform this analysis in PYLON uses two recordings; let's call them A and B.

Recording A

The goal of Recording A is to gather hashtags associated with the sector you're interested in. To illustrate the case I'm going to use our automotive index.

It's easy to write an interaction filter in CSDL to record data to an index. The CSDL is a simple text string that you can send to the /pylon/compile endpoint to compile it and receive a hash in return, and then to the /pylon/start endpoint to kick off a recording. It can be as simple as this:

fb.topics.category in "Cars, Automotive" or  fb.parent.topics.category in "Cars, Automotive"

Hashtags can often have local significance so let's and an extra condition using the fb.author.country target to restrict our filtering to the US:

(fb.topics.category in "Cars, Automotive" or fb.parent.topics.category in "Cars, Automotive") and fb.parent.author.country == "United States"

This additional restriction also reduces the likelihood that you'll use up your daily indexing allowance sooner than you want to.

I'm using simple cURL requests in this blog, When you're ready to begin writing code in production take a look at our client library documentation for your favorite language.

To submit this code to /pylon/compile:

curl -X POST https://api.datasift.com/v1.3/pylon/compile 
    -d '{"csdl": "(fb.topics.category in \"Cars, Automotive\" or fb.parent.topics.category in \"Cars, Automotive\") and fb.author.country == \"United States\")"}' 
    -H 'Authorization: :' 
    -H "Content-type: application/json"

The endpoint returns the hash along with other useful information:

{
    "hash": "7263112790b6148ad3c91e7e17959bd4",
    "created_at": "1457542595",
    "operator_grouping": {
            "tag": {
                "medium": 0,
                "complex": 0,
                "keywords": 0
            },
            "return": {
                "medium": 0,
                "complex": 0,
                "keywords": 1
            }
    },
    "tokenizer_info": {
            "language": {
                "default": 1
            },
            "punctuation": []
    }
}

Then you'll need to pass the hash to the /pylon/start endpoint to start the recording:

curl -X POST https://api.datasift.com/v1.3/pylon/start 
    -d '{"hash": "7263112790b6148ad3c91e7e17959bd4"}' 
    -H 'Authorization: :' 
    -H "Content-type: application/json"

The endpoint returns an id for the recording. You'll need that when you want to go ahead with your filter swap. The object containing the id looks like this:

HTTP/1.1 200 OK

{
    "id": "d1b7d73b47c639ea3cc290595bca888ca4388afe"
}

Recording A is now running and your next step is to make analysis queries against it using the /pylon/analyze endpoint. One really important question is how long should you wait before you do that? PYLON has a built-in privacy feature called redaction. If there isn't a large enough pool of data to hide the identity of individual authors, redaction kicks in and your analysis queries won't return data. So Recording A needs to run for long enough to write a sufficient number of interactions from a sufficient number of individual authors the index. However, the longer you run it for, the greater the chance that it will become out of date before you even begin to use it.

A little experimentation is in order here. There's no single right answer because it depends on what you're filtering for but an initial guess might be to wait for 24 hours before you make a call to /pylon/analyze for the first time, and then hit it daily.

You can leave Recording A running perpetually and adjust your start and end parameters to select the interval you want to analyze. Remember that the recording contains more than just hashtags, it contains the things people are saying about the sector you're studying so it has a lot of value quite apart from the role it plays in this use case. For example, you could run an analysis query to watch the ebb and flow of sentiment using the fb.sentiment and fb.parent.sentiment targets in an analysis query.

You need to present your anaylsis queries as JSON objects. Here's an analysis query that extracts hashtags from the index that Recording A populated:

curl -X POST https://api.datasift.com/v1.3/pylon/analyze 
    -d '{ "id": "521631745ae3368ed63f6764f5a11f8e", "parameters": { "analysis_type": "freqDist", "parameters": { "threshold": 10, "target": "fb.hashtags" } }, "start": "1457949600", "end": "1458036000" }' 
    -H 'Authorization: :' 
    -H "Content-type: application/json"

Notice that the threshold parameter is set to 10. This opens the door just wide enough to extract the top 10 hashtags. You can adjust this value, of course.

A successful call to /pylon/analyze returns a 200 OK response code and a JSON object containing the results of your analysis query. Here's an example of the JSON object:

{
    "interactions": 1604000,
    "unique_authors": 1291500,
    "analysis": {
            "analysis_type": "freqDist",
            "parameters": {
                "target": "fb.hashtags",
                "threshold": 10
            },
            "results": [{
                "key": "bmw",
                "interactions": 4100,
                "unique_authors": 3400
            }, {
                "key": "next100",
                "interactions": 1600,
                "unique_authors": 1100
            }, {
                "key": "cars",
                "interactions": 1200,
                "unique_authors": 700
            }, {
                "key": "car",
                "interactions": 700,
                "unique_authors": 300
            }, {
                "key": "mercedes",
                "interactions": 400,
                "unique_authors": 300
            }, {
                "key": "audi",
                "interactions": 400,
                "unique_authors": 300
            }, {
                "key": "lexus",
                "interactions": 300,
                "unique_authors": 100
            }, {
                "key": "infiniti",
                "interactions": 300,
                "unique_authors": 100
            }, {
                "key": "acura",
                "interactions": 300,
                "unique_authors": 200
            }, {
                "key": "volkswagen",
                "interactions": 300,
                "unique_authors": 200
            }],
            "redacted": false
    }
}

The hashtags are in the "key" field: bmw, next100, cars, car, mercedes, audi, lexus, infiniti, electriccar, volkswagen. For each of them you can see an interaction count and a unique author count, rounded according to PYLON's quantization rules. The bmw hashtag is one that you might naturally expect to appear on the list. The next100 hashtag is more surprising. It appears because BMW recently celebrated its centenary.

You might feel that the "cars" and "car" hashtags too general and too similar to be helpful. If so, you could ignore them, set the threshold to 12, and run the query again to make sure you have 10 useful hashtags to investigate. The counts show the share of voice for each brand, just one illustration that this recording, designed to harvest hashtags, surfaces other, interesting insights too.

Recording B

Let's take a step back to review what's been happening so far. First you wrote an interaction filter and set it running to record data to an index. Then you ran an anaylsis query against the index to discover the most popular hashtags.

Now you can write a new interaction filter for Recording B. For example:

fb.all.content any "bmw, next100, cars, car, mercedes, audi, lexus, infiniti, electriccar, volkswagen"

Each recording you start records to its own index so, when you run Recording B, PYLON creates a new index. Unlike the index that Recording A wrote to, which was important for 24 hours but became potentially disposable after that, Recording B needs to run perpetually to build up a body of information about stories that include the top hashtags plus the engagements on those stories. Now this recording contains information in a broader context than just the automotive sector. It gathers comments and engagements relating to the #electriccar hashtag, for example.

The index created by Recording B will persist. However, keep in mind that DataSift will delete any data that has been stored for more than 32 days. The recording itself will still remain and will still contain any data that is 'younger' that 32 days.

Once you begin to populate your index for Recording B you can then run further analysis queries. Take a look at the Facebook for Topic Data targets page to see which analysis targets you can use in your analysis queries. Remember that you can combine your analysis queries with query filters to narrow down the data before you analyze it. A query filter performs secondary filtering; that is, it filters the data from the index before you perform any analysis. For example, if you wanted to look at stories from women you would use the filter parameter to add a query filter:

curl -X POST https://api.datasift.com/v1.3/pylon/analyze 
    -d '{ "filter": "fb.author.gender == \"female\"", "id": "49ea352831788ae3368ed63f8764f5a", "parameters": { "analysis_type": "freqDist", "parameters": { "threshold": 10, "target": "fb.hashtags" } }, "start": "1457949600", "end": "1458036000", }' 
    -H 'Authorization: :' 
    -H "Content-type: application/json"

Now let's swap the filter in Recording B

Suppose that 24 hours have passed and you run another analysis query against Recording A to retrieve the latest hashtags. If the top 10 hashtags are unchanged you can simply leave Recording B running. There's no need to make any changes. But if the top 10 hashtags in your new Recording A are different you'll need to:

  1. Create a new interaction filter for Recording B, containing the latest set of hashtags.
  2. Hit /pylon/compile to compile the new CSDL code you want to use in your interaction filter.
  3. Hit /pylon/update to switch over to the new CSDL code.

The CSDL code for your interaction filter for Recording B was:

fb.all.content any "bmw, next100, cars, car, mercedes, audi, lexus, infiniti, electriccar, volkswagen"

Suppose volkswagen falls out of the top 10 and alfaromeo takes its place. Your new interaction filter becomes:

fb.all.content any "bmw, next100, cars, car, mercedes, audi, lexus, infiniti, electriccar, alfaromeo"

The change is minor but it is significant because the new CSDL and the old CSDL have different hashes. When you pass the new CSDL to /pylon/compile it passes back the new hash. To swap the new code for the old, you just need to pass that new hash to the /pylon/update endpoint.

curl -X POST https://api.datasift.com/v1.3/pylon/update 
    -d '{"hash": "85572631490b6888ad3cc1e7e178e8e3", "id": "49ea352831788ae3368ed63f8764f5a"}' 
    -H 'Authorization: :' 
    -H "Content-type: application/json"

It takes just a few seconds for the new CSDL interaction filter to become active and start storing its results in Recording B's index.

Summing it all up

For the first time it's possible to update an interaction filter in PYLON while it's running.

For API versions up to v1.2 the /pylon/analyze endpoint required the hash of the CSDL interaction filter you wanted to analyze. From v1.3 /pylon/analyze takes the id of the recording instead. Changing the CSDL code of a filter changes the hash of the filter but the id of the recording remains the same. That means you don't need to change your calls to the /pylon/analyze endpoint each time you change your CSDL.


Previous post: Filter Swapping (part 1)

Next post: New Tokenized Targets for PYLON Query Filters