Filter Swapping

In this design pattern I'm going to look at the filter swap feature which allows you to change the code of an interaction filter in PYLON without stopping and restarting the filter. It follows on from our recent filter swapping blog.

note icon

This design pattern references our example automotive recording which is available in the analysis sandbox. This recording captures audiences discussing three car brands and uses CSDL similar to the interaction filter shown in step 1 here. Contact your sales representative or account manager for access to the 'analysis sandbox'.

Table of contents

Why swap filters?

An interaction filter in PYLON records data to an index. There are many business situations when it would be useful to be able to change the definition of an interaction filter without stopping it, ensuring data flows into the index without interruption. For example:

  • You're monitoring the current top 20 movies throughout a year and you want to adjust your filter as new titles are released.
  • You're investigating top news stories from different countries around the world. As new stories break you want to focus on the countries where the most significant stories are coming from.
  • You're sampling data and you want to adjust your sample rate on the fly.
  • You're monitoring a brand and its competitors. You notice a competitor that you had not anticipated and you'd like to include it in your filter.

Let's look at each of these examples. For the movies you might plan to perform the update regularly, on a weekly basis maybe. For the news stories, you might make the updates on an as-needed basis, each update being triggered by an event around the world, often something that could not be foreseen. In the remaining two examples, you might only ever need to make one or two updates - the filter swap feature is a nice-to-have in these cases but not something essential to your design.

Solution

Now that we've established the concepts let's take a look at a worked example. We'll create a new interaction filter and begin to record data to an index. Then we'll create a new version of the filter and swap that in to replace our original filter. It takes just four API calls to compile a filter and set it running and then swap the recording to a new filter. This table shows the API calls together with the input and output for each call.

Endpoint Input Output
Step 1.
Compile the initial CSDL for the filter.
/pylon/compile A string containing the CSDL definition of the filter.
A unique has that identifies the compiled filter.
Step 2.
Start the filter. PYLON creates a new index for you and stores the results in that index.
/pylon/start
The id of the recording.
Step 3.
Compile the CSDL for the new filter.
/pylon/compile A string containing the CSDL definition of the new filter.
A unique hash that identifies the new compiled filter.
Step 4.
Swap the old CSDL definition out and the new CSDL definition in.
/pylon/update and 204 No Content.

Step 1: Compile the initial CSDL for the filter

I'm using a simplified version of the interaction filter from our automotive example. I've left the CSDL filtering code unchanged but I've stripped out all the tagging logic:

( fb.parent.content contains_any "Ford, BMW, Honda" OR 
  fb.content contains_any "Ford, BMW, Honda" ) 
AND 
( fb.topics.category in "Cars, Automotive" OR 
  fb.parent.topics.category in "Cars, Automotive" ) 
AND 
( fb.author.country_code in "US" )

This CSDL specifies that we want to record stories and engagements from US-based users that mention any of three well-known brands and that relate to car or automotive topics categories.

Using the Python client library to compile the filter:

client = Client("DataSift username", "Identity API key")

csdl_1 = '(fb.parent.content contains_any "Ford, BMW, Honda" OR fb.content contains_any "Ford, BMW, Honda") \
          AND (fb.topics.category in "Cars, Automotive" OR fb.parent.topics.category in "Cars, Automotive") \
          AND (fb.author.country_code in "US")'

# Compile the initial filter
filter_1 = client.pylon.compile(csdl_1)

Step 2: Start the initial filter

The next step is to start the filter and record the results to an index.

# Start recording the initial filter
recording = client.pylon.start(filter_1['hash'], "Hotswap example 1")

We can hit the /pylon/get endpoint to verify that the recording is using the initial hash.

client.pylon.get(recording['id'])

The output looks like this:
{
  u'end': None,
  u'hash': u'01143c05e523b25225f647a3f4919720',
  u'id': u'635a68dd029892b89280792dafc8a61696dd4e59',
  u'identity_id': u'03e121a6455ea6b266f19ac23d769a0a',
  u'name': u'Hotswap example 1',
  u'reached_capacity': False,
  u'remaining_account_capacity': 754000,
  u'remaining_index_capacity': 1000000,
  u'start': datetime.datetime(2016, 4, 1, 16, 40, 5),
  u'status': u'running',
  u'volume': 0
}

Step 3: Compile the CSDL for the new filter

Now suppose we decide to shift the focus of our study from BMW to Fiat, leaving Ford and Honda unchanged. The new CSDL for our interaction filter becomes:

( fb.parent.content contains_any "Ford, Fiat, Honda" OR 
  fb.content contains_any "Ford, Fiat, Honda" ) 
AND 
( fb.topics.category in "Cars, Automotive" OR 
  fb.parent.topics.category in "Cars, Automotive" ) 
AND 
( fb.author.country_code in "US" )

We can compile this filter right now and swap it into place whenever we want. Compiling it simply generates a hash. It does not change our recording:

csdl_2 = '(fb.parent.content contains_any "Ford, Fiat, Honda" OR fb.content contains_any "Ford, Fiat, Honda") \
          AND (fb.topics.category in "Cars, Automotive" OR fb.parent.topics.category in "Cars, Automotive") \
          AND (fb.author.country_code in "US")'

# Compile the initial filter
filter_2 = client.pylon.compile(csdl_2)

Step 4: Swap the new CSDL filter in

When we're ready, we hit the /pylon/update endpoint to swap the new code in:

# Update to the new filter
client.pylon.update(recording['id'], filter_2['hash'], "Hotswap example 2")

If we call /pylon/get again, as in step 2, we see that the name and hash both reflect the update. The id is a unique identifier for a recording and never changes:

{u'end': None,
  u'hash': u'a3a5612ceacaa53e10a241ebd22a0442',
  u'id': u'635a68dd029892b89280792dafc8a61696dd4e59',
  u'identity_id': u'03e121a6455ea6b266f19ac23d769a0a',
  u'name': u'Hotswap example 2',
  u'reached_capacity': False,
  u'remaining_account_capacity': 753100,
  u'remaining_index_capacity': 1000000,
  u'start': datetime.datetime(2016, 4, 1, 16, 40, 5),
  u'status': u'running',
  u'volume': 0
}

Considerations

Here are a few things to keep in mind when you use filter swapping:

  • When you hit the /pylon/update there is short a delay while DataSift loads the new filter. Hit the /pylon/get endpoint if you want to check that a new filter is active.
  • Keep a note of the time you make any update. For example, suppose you run a filter to record stories about Coke for 24 hours and then update the filter to include engagements about Coke too. Any analysis you perform after the update will be influenced by the start and end parameters you choose. If you choose a window that falls in the first 24 hours, your analysis will not find any engagements because your index does not contain any for the time interval.
  • Additionally we recommend that you always specify start and end times for your analysis queries. This is particularly important when you use filter swapping because when you swap in a new CSDL filter the nature of your recording might change entirely.

When PYLON originally launched, a recording was identified by the hash of the CSDL filter. If you wanted to change the CSDL you had to send it to the /pylon/compile endpoint which returned a new hash. You had to start a new recording with that hash. The latest release of PYLON which I've described in this design pattern brings a new API version, v1.3, and a slightly different way of working. In v1.3, every CSDL filter still has a unique hash but we now give each recording a unique id which never changes. You supply this id to the /pylon/update endpoint whenever you want to change the CSDL code for a recording.

Additional examples

Two other situations where filter swapping is useful are changing sampling rates and changing tagging logic.

Changing sampling rates

We can limit the sampling rate when the volume is high and we want to avoid hitting our daily limit of one million interactions in our index.

( fb.parent.content contains_any "Ford, BMW, Honda" OR 
  fb.content contains_any "Ford, BMW, Honda" ) 
AND 
interaction.sample < 0.5

Here the sampling rate is just 0.5%. This is useful when traffic is consistently high but many events show a long tail following an initial burst of activity. Think of the Superbowl, or a major election, or a sudden meteorological event. The sampling technique, which was initially effective, might end up limiting the volume too much and causing redaction. Filter swapping is useful in these cases as a way to 'tune' volume over time.

Changing tagging logic

Another common use case for filter swapping is when you want to add or update the tagging logic for a filter. For example, here is the updated CSDL, now with some tagging logic.

tag.automotive.brand "Ford"  { fb.parent.content any "Ford"  OR fb.content any "Ford"  } 
tag.automotive.brand "Fiat"  { fb.parent.content any "Fiat"  OR fb.content any "Fiat"  } 
tag.automotive.brand "Honda" { fb.parent.content any "Honda" OR fb.content any "Honda" } 

return { 
   ( fb.parent.content any "Ford, Fiat, Honda" OR 
     fb.content any "Ford, Fiat, Honda" ) 
AND 
   ( fb.topics.category in "Cars, Automotive" OR 
     fb.parent.topics.category in "Cars, Automotive" ) 
AND 
   ( fb.author.country_code in "US" ) 
}

To swap this in we simply compile it and then hit the update endpoint once again.

Resources

For further reading, take a look at the following resources: