Recording Data

Before you can create analysis results you need to record data into an index. In this guide you'll learn how to filter and record data ready for analysis.

Recording data for analysis

When you work with PYLON you start by recording data you'd like to analyze into an index. Interaction filters define what data you would like to record from data sources. Running an interaction filter records any matching data from the source into an index of data.

When you define your interaction filters keep these goals in mind:

  • Coverage - You need to capture all of the data you require for the analysis you have planned. You'll often find that once you receive your first results you'll want to improve your filter to include topics you hadn't considered.
  • Minimum noise - When you analyze your data set you want it to be as clean as possible.
  • Within limits - Although you want to capture a broad set of data, you also want to stay within your limits. If you hit limits, no more data will be recorded for the day and you'll have a gap in your data set.
  • Maintenance - Over time your requirements will change and terms used in discussions will evolve.

Defining an interaction filter

You use CSDL to define the conditions of your interaction filter. CSDL works by inspecting the values of data fields (targets) of interactions that arrive from a source.

This simple filter selects Facebook stories that mention well-known car brands:

fb.content contains_any "BMW, Honda, Ford"

There are many targets available and many operators you can use in your conditions. You can combine up to 1,000 conditions using logical operators in one interaction filter definition. Our CSDL reference explains more about CSDL syntax.

We can add to the example by firstly stating that we only want stories created by authors in the US, then also look for stories that contain the keywords or are identified as being about the topic 'Cars'. == "United States" 
AND ( fb.content contains_any "BMW, Honda, Ford" 
    OR in "Cars" )

When you record and analyze Facebook topic data for most use cases you'll want to record both stories and engagements. You can read more about the data available in our introducing Facebook topic data guide.

To extend our example filter to record both stories and engagements we can add conditions based on targets in the fb.parent.* namespace: == "United States" 
AND ( 
    (fb.content contains_any "BMW, Honda, Ford" OR in "Cars") 
    (fb.parent.content contains_any "BMW, Honda, Ford" OR in "Cars") 

Take a look at our example interaction filters to see how to work with different targets.

Recording data to an index

To record data you need to compile an interaction filter and then start a recording. The dashboard is a good place to start exploring, but for all production work you should be using the API.

When you compile an interaction filter you will be given a hash by the platform. This hash is unique for the filter definition. When you start a recording the platform gives you a new id for the recording which you use to reference the recording (or index) in future.

A new index will be created for each recording. You can run many recordings in parallel and create many indexes of data as long as you don't exceed your account limits.

Starting recordings using the dashboard

  1. Log in to the dashboard.

  2. Click the PYLON tab, and then click My PYLON Filters.

  3. Click Create a Filter.

  4. In Name, type a name for the interaction filter.

  5. Make sure CSDL Code Editor is selected, and click Start Editing.

  6. In the code editor paste in the example CSDL from above.

  7. Click Save & Close. This will validate and compile the filter for you.

  8. Click Record Filter to start recording data to an index.

  9. You can stop your recording at any time. Click the PYLON tab, and then click My Recordings. Here you will find a list of your recordings which you can Pause and Stop.

Starting recordings using the API

We recommend you use one of our client libraries to work with the API. This example uses the Python library:

  1. Import the client library and create a client object to call the API.

    from datasift import Client datasift = Client("your username", "valid identity API key")

  2. Call the /pylon/compile endpoint with the CSDL for your filter.

    csdl = ''' == "United States" AND ( // For stories (fb.content contains_any "BMW, Honda, Ford" OR in "Cars") OR // For engagements (fb.parent.content contains_any "BMW, Honda, Ford" OR in "Cars") ) ''' compiled = datasift.pylon.compile(csdl)

  3. Call the /pylon/start endpoint with a name for your recording. Note that the response from the compile endpoint contains a hash which we use to start the recording.

    recording = datasift.pylon.start(compiled['hash'], 'Automotive example') print('Recording started!')

  4. To stop your recording you can call the /pylon/stop endpoint.

    datasift.pylon.stop(recording['id']) print('Recording stopped.')

  5. To check the status of your recording you can call the /pylon/get endpoint.


What have I recorded?

When you start recording data, your first question will be 'What have I managed to record?'. Take a look at our Managing Recordings guide to learn how to inspect what's inside your index.

Limits on recordings

When you record interactions you need to stay within platform and account limits:

  • Your account package specifies the maximum number of recordings you can run at any time.
  • Your account package specified the maximum number of interactions you can record in a month which is enforced as a daily limit.
  • You can record a maximum of 1 million interactions into an index each day.

Platform and account limits you are detailed on the Platform Allowances page.

Improving your interaction filter

Once you've started to record data you'll want to improve your filter to capture more relevant data and remove as much noise as possible.

Broadening your filter

When you explore the data in your index you'll identify further topics and terms that will help you broaden your filter to capture more data.

Keeping to our automotive example, you might find that related topics arise such as 'Driving'. You could add these to your filter:

(fb.content contains_any "BMW, Honda, Ford" OR in "Cars,Driving")

It's important that when you broaden your filter in this way you keep an eye on how much the volume of your recording changes. If the volume increases suddenly then you may have introduced a term that captures a lot of noise that you do not want.

Removing noise

When you explore your data you will find noise (spam) that you want to exclude to improve the quality of your data set.

You can use analysis queries to test potential improvements to your interaction filter. For example, if you find links shared in your data that suggest spam, you could test excluding them with the following analysis query:

datasift.pylon.analyze([id for your recording], {
        "threshold": 4
    "filter": "NOT links.domain IN \"\""

If the analysis filter improves your results then you can add the condition to your interaction filter.

You can also check the accuracy of your filter by accessing super public data samples. Read our design pattern to learn more.

Updating the filter used for a recording

When you start exploring a new project you'll want test changes to your interaction filter. You'll also need to update interaction filters for live projects as requirements change over time or the audience you're studying changes the topics and keywords they are discussing.

When you update the CSDL for an interaction filter and compile it the hash for the filter will change because the CSDL definition has changed. You use this hash to update the filter definition used by a recording.

Updating a filter using the API

  1. Call the /pylon/compile endpoint with your new CSDL.

    csdl = ''' == "United States" 
    AND (
      // For stories
      (fb.content contains_any "BMW, Honda, Ford" OR in "Cars, Driving")
      // For engagements
      (fb.parent.content contains_any "BMW, Honda, Ford" OR in "Cars, Driving")
    compiled = datasift.pylon.compile(csdl)

  2. Call the /pylon/update endpoint.

    datasift.pylon.update([id for your recording], compiled['hash'], 'Automotive example')
    print('Recording updated!')

  3. To check the status of your recording you can call the /pylon/get endpoint.

    datasift.pylon.get([id for your recording])

When you update a filter for a recording the change will be applied in seconds. The platform in fact runs the old filter and new filter briefly in parallel so that there is no gap in the recording and handles deduplication for you.

Reusing interaction filter definitions

With CSDL you can save a definition then import it into another definition, using the stream keyword.

For example you could create a reusable noise filter. Here's example CSDL that selects stories (and engagements on stories) which are less than 1000 or characters long and contain less than 25 capital letters:

not ( 
    (fb.content regex_exact ".{1000,}" or fb.parent.content regex_exact ".{1000,}")  
    OR (fb.content regex_partial "[A-Z\\s]{25}" or fb.parent.content regex_partial "[A-Z\\s]{25}") 

You can compile this interaction filter using the pylon/compile endpoint or the dashboard as described above. This will give you a hash for the filter that you can use in another filter. For example we could create a filter looking for stories about cars that excludes noise:

fb.topics.category == "Cars" AND stream [hash]

Next steps...

Now that you know more about recording data we recommend you take a look at the following resources:

  • Managing recordings - learn more about how to manage and monitor your recordings
  • Introducing Facebook topic data - our guide to the data model for Facebook topic data
  • Targets - the targets you can use when creating interaction filters
  • CSDL - learn more about the syntax of CSDL and operators you can use in your filters
  • Example filters - take a look at some example filters to get more ideas for your project