Optimizing Migrated CSDL Filters

The translating PowerTrack rules to CSDL guide explains how you can migrate your PowerTrack rules to the DataSift platform. This guide explains how you can optimize your rules to reduce your ongoing platform costs.

Charges for using DataSift STREAM

When using DataSift STREAM you are charged for:

  • the data you consume, known as data licensing.
  • the processing effort required to execute your filters.

In this guide we'll focus on the processing costs and assume you are consuming the data you need from the sources you have selected.

The processing effort depends on the complexity of your CSDL filters. The complexity of a filter depends on a number of factors, including:

  • the number of conditions
  • the number of targets referenced
  • the operators used in your conditions
  • the number of arguments specified for each condition
  • the number of tagging rules used

When writing CSDL it is possible to express the same filter in a number of ways using different operators and combinations of conditions. Depending on how you express your filter the cost will vary.

When you save, validate, or compile a filter the platform returns a rating for the complexity of the filter expressed in data processing units (DPUs). DPUs express the hourly processing effort required to run a filter. If your filter is rated at 1 DPU, this means it costs 1 DPU for you to run the filter for one hour.

Your account package will include a number of DPUs that you can consume within a month. By optimizing your filters you can run more filters and get more from the platform within your monthly allowance.

You can read more about charges on our billing page.

Viewing the cost of a filter

You can see the DPU cost of a filter when you save it using the dashboard:

filter-saved

You can see the cost of running this filter is 0.9 DPUs.

If you scroll down the same page you can look at the Filter Breakdown which shows you how the cost of the filter breaks down by operator and target:

filter-breakdown

When using the API you can use the /validate endpoint to calculate the DPU cost for a filter before you compile the CSDL. If you have already compiled the CSDL you can use the /dpu endpoint, specifying the hash of the filter you'd like to see costs for.

The /dpu endpoint gives you the most detail, showing you the cost broken down by each operator in your filter:

{
  "dpu": 0.9,
  "detail": {
    "in": {
      "count": 37,
      "dpu": 0.7,
      "targets": {
        "interaction.content": {
          "count": 4,
          "dpu": 0.2
        },
        "links.domain": {
          "count": 3,
          "dpu": 0.1
        },
        "tumblr.reblogged.root.url": {
          "count": 30,
          "dpu": 0.4
        }
      }
    }...
  }
}

Optimizing individual CSDL filters

You can start optimizing your CSDL by looking at filters individually.

If you've made use of our PowerTrack to CSDL translation tool you will be given CSDL filters that work but need to be optimized to make best use of your DPU allowance.

For example a simple PowerTrack rule:

(audi OR bmw OR honda OR url_contains:"audi.com" OR url_contains:"bmw.com" OR url_contains:"honda.com")

Will be translated by the tool to the following CSDL:

(interaction.content CONTAINS "audi" OR 
(interaction.content CONTAINS "bmw" OR 
(interaction.content CONTAINS "honda" OR 
(links.url SUBSTR "audi.com" OR 
(links.url SUBSTR "bmw.com" OR 
links.url SUBSTR "honda.com")))))

This CSDL can be optimized (whilst selecting exactly the same interactions) to:

interaction.content contains_any "audi, bmw, honda" 
OR links.domain in "audi.com, bmw.com, honda.com"

The tool will translate the PowerTrack as is, one condition at a time. However, CSDL allows you to be more expressive. For example the contains_any operator allows us to look for any of a number of words and phrases in the content. Specifying one condition with three arguments costs fewer DPUs than specifying three separate conditions each searching for one word using the contains operator.

The cost of operators is listed on the billing page.

Reducing the number of conditions

The more conditions in your CSDL filter, the more expensive your filter will be to run.

Take this example filter:

(interaction.content contains "audi" AND links.domain == "audi.com") 
OR 
(interaction.content contains "audi" AND links.url contains_any "/reviews,/car-reviews")

This filter selects content that mentions Audi and links to the Audi website, plus content that mentions Audi and contains a link to a review. The filter contains four conditions.

You could rewrite this filter to use only three conditions:

interaction.content contains "audi" AND 
( 
    links.domain == "audi.com" OR links.url contains_any "/reviews,/car-reviews" 
)

For a simple filter like this the cost difference is negligible, but the principle is important as when you have rules that contain a large number of conditions this refactoring can make a large difference in the cost of a filter.

Using operators that accept multiple arguments

Often you will want to match lists of keywords, phrases, and links in your filters. Where possible you should look to use the contains_any and in operators to scale your filters and avoid large DPU costs.

If your original PowerTrack rule matches a list of keywords and phrases in the content of posts:

(audi OR bmw OR "mercedes-benz" OR "alfa romeo")

This will be translated by the online tool to:

(interaction.content CONTAINS "audi" OR 
(interaction.content CONTAINS "bmw" OR 
(interaction.content CONTAINS "mercedes-benz" OR 
interaction.content CONTAINS "alfa romeo")))

This filter contains four conditions, yet you could rewrite it filter to use one condition:

interaction.content contains_any "audi,bmw,mercedes-benz,alfa romeo"

Again for a low number of terms this will not incur a large cost, but when you have many terms in a production application this optimization make a big difference.

The contains_any operator allows you to match keywords and phrases in a string. The in operator allows you to match exact values.

This PowerTrack rule which matches links:

url_contains:"blog1.tumblr.com" OR url_contains:"blog2.tumblr.com" OR url_contains:"blog3.tumblr.com"

Could be written as follows in CSDL:

(links.domain == "blog1.tumblr.com" OR 
(links.domain == "blog2.tumblr.com" OR 
links.domain == "blog3.tumblr.com"))

The optimized CSDL would instead use the in operator:

links.domain in "blog1.tumblr.com, blog2.tumblr.com, blog3.tumblr.com"

Maintaining logical operators

Of course any optimization needs to maintain the logical features of the original filter.

This PowerTrack rule:

(audi OR bmw OR "mercedes-benz") -"alfa romeo"

Can be translated to CSDL as:

((interaction.content CONTAINS "audi" OR 
(interaction.content CONTAINS "bmw" OR 
interaction.content CONTAINS "mercedes-benz")) AND 
NOT(interaction.content CONTAINS "alfa romeo"))

Then optimized as:

interaction.content CONTAINS_ANY "audi,bmw,mercedes-benz" 
AND NOT interaction.content CONTAINS "alfa romeo"

We cannot optimize this filter further as we need to maintain the exclusion of content containing the phrase "alfa romeo".

Optimizing across multiple CSDL filters

You can also make significant optimizations by refactoring and merging individual rules.

Due to the different models supported by the PowerTrack and DataSift STREAM products your translated rules are likely to contain common conditions that you can refactor out.

Refactoring common filter conditions

For example, if you have the following four rules in PowerTrack:

audi a1 url_contains:"audi.com"
audi a3 url_contains:"audi.com"
bmw "1 series" url_contains:"bmw.com"
bmw "3 series" url_contains:"bmw.com"

Translating each rule in turn would give four CSDL filters:

interaction.content contains_all "audi,a1" AND links.domain == "audi.com"

interaction.content contains_all "audi,a3" AND links.domain == "audi.com"

interaction.content contains_all "bmw,1 series" AND interaction.content contains "1 series" AND links.domain == "bmw.com"

interaction.content contains_all "bmw,3 series" AND interaction.content contains "1 series" AND links.domain == "bmw.com"

Notice these filters are already optimized to use the contains_all operator instead of two individual conditions using contains.

Refactoring out the common conditions would result in two filters:

links.domain == "audi.com" AND 
( 
    interaction.content contains_all "audi,a1" 
    OR interaction.content contains_all "audi,a3" 
)

links.domain == "bmw.com" AND 
( 
    interaction.content contains_all "bmw,1 series" 
    OR interaction.content contains_all "bmw,3 series" 
)

These two filters could be combined into one filter:

( 
     links.domain == "audi.com" AND 
     ( 
        interaction.content contains_all "audi,a1" 
        OR interaction.content contains_all "audi,a3" 
    ) 
) 
OR 
( 
    links.domain == "bmw.com" AND 
    ( 
        interaction.content contains_all "bmw,1 series" 
        OR interaction.content contains_all "bmw,3 series" 
    ) 
)

You may also consider adding tags to identify which original rule selected any interactions you receive:

tag.car "Audi A1" { interaction.content contains_all "audi,a1" } 
tag.car "Audi A3" { interaction.content contains_all "audi,a3" } 
tag.car "BMW 1 series" { interaction.content contains_all "bmw,1 series" } 
tag.car "BMW 3 series" { interaction.content contains_all "bmw,3 series" } 
return { 
    ( 
        links.domain == "audi.com" AND 
        ( 
            interaction.content contains_all "audi,a1" 
            OR interaction.content contains_all "audi,a3" 
        ) 
    ) 
    OR 
    ( 
        links.domain == "bmw.com" AND 
        ( 
            interaction.content contains_all "bmw,1 series" 
            OR interaction.content contains_all "bmw,3 series" 
        ) 
    ) 
}

This filter will be less expensive to run because we have reduced the number of conditions and switched to use the contains_all operator. In addition this filter is more readable and more easily maintained going forwards.

Refactoring sampling conditions

The same is true for sampling rates. Often you will have common rates of sampling across many of your rules.

For example, if you have the following four rules in PowerTrack:

audi sample:60
bmw sample:30
"mercedes-benz" sample:30
"alfa romeo" sample:60

These rules will be individually translated into four separate CSDL filters:

interaction.content contains "audi" AND interaction.sample > 60

interaction.content contains "bmw" AND interaction.sample > 30

interaction.content contains "mercedes-benz" AND interaction.sample > 30

interaction.content contains "alfa romeo" AND interaction.sample > 60

As the rules share common sampling rates you can refactor the code into two filters:

interaction.content contains_any "audi, alfa romeo" AND interaction.sample > 60

interaction.content contains_any "bmw, mercedes-benz" AND interaction.sample > 30

In fact, you could rewrite this code in one filter:

(interaction.content contains_any "audi, alfa romeo" AND interaction.sample > 60) 
OR 
(interaction.content contains_any "bmw, mercedes-benz" AND interaction.sample > 30)

This has reduced the number conditions to 8 from 4 and resulted in a more concise, maintainable filter.

Again you could add tags to identify which rules matched the interactions:

tag.brand "Audi" { interaction.content contains "audi" } 
tag.brand "BMW" { interaction.content contains "bmw" } 
tag.brand "Mercedes" { interaction.content contains "mercedes-benz" } 
tag.brand "Alfa Romeo" { interaction.content contains "alfa romeo" } 
return { 
    (interaction.content contains_any "audi, alfa romeo" AND interaction.sample > 60) 
    OR 
    (interaction.content contains_any "bmw, mercedes-benz" AND interaction.sample > 30) 
}