CSDL Optimization Techniques

Jason | 10th May 2012

There are plenty of ways to optimize your CSDL. Remember, the more optimal your CSDL is, the less it costs you.

Overusing Operators

Many people heavily overuse the or operator in the following way:

interaction.content contains "word1" or  
interaction.content contains "word2" or  
interaction.content contains "wordN"

Repeating this for 100 different keywords (a fairly conservative search!) equates to a stream cost of 2.6 DPUs, or $0.52 an hour running cost.

The " contains ... or " syntax, although it works, is very computationally heavy. DataSift much prefers you use the contains_any operator, and rewards you accordingly:

interaction.content contains_any "word1, word2, ...., wordN"

This syntax results in a stream cost of only 1DPU for tracking 100 keywords, or just $0.20 an hour running cost. This is less than half the cost of the " contains ... or " syntax.


Many of our operators are charged at different rates, due to their computational complexity. contains and contains_any are two variable cost operators, both charged at different rates. The cost of using the contains operator in this way increases by 0.1 DPU for every four extra values you add to it. The cost of using the contains_any operator only increases after adding ten extra values. The contains_any operator is also determined on a sliding scale, so after adding your hundredth value to the list of keywords you track, you start to see even greater savings. Full details of how we charge for each operator can be found on our Understanding Billing page.

DRY - Don’t Repeat Yourself

“Don’t Repeat Yourself” is a concept in software engineering that is aimed at reducing the unnecessary repetition of information. The same concept can be applied to your CSDL, and it can help reduce your stream costs.

Some of you may be familiar with the stream keyword, which allows you to include one stream inside another as a sub-stream, or merge two or more streams.

Let’s look at a simple use case: we want to follow any mentions of two companies, on Twitter, in English, from a user with a Klout score of 50 or more:

tag "Company1" { stream "1234567890" }  
tag "Company2" { stream "0987654321" }  
return {  
   stream "1234567890" or  
   stream "0987654321"  

Now, let’s imagine both of those sub-streams use the following template:

( interaction.content contains_any "Product1, Product2, Product3, CompanyName, CompanyNickname" or  
twitter.mentions == "CompanyTwitter" or  
links.title contains "CompanyName" ) and  
interaction.type == "twitter" and  
language.tag == "en" and  
klout.score >= 50

As a note, these sub-streams cost 0.6 DPUs, so the “master stream” merging the two sub-streams costs 1.2 DPUs.

If using this kind of template, we would obviously want to keep any variable arguments separate, such as the companies’ names or Twitter handles. Any static arguments which would be the same across all sub-streams, such as the Klout score or language we are looking for, can be moved out of the sub-streams, and up into the master stream to avoid repetition. This will give us a master stream like the following:

tag "Company1" { stream "1234567890" }  
tag "Company2" { stream "0987654321" }  
return {  
   ( stream "1234567890" or  
   stream "0987654321" ) and  
   interaction.type == "twitter" and  
   language.tag == "en" and  
   klout.score >= 50  

And sub-streams like so:

( interaction.content contains_any "Product1, Product2, Product3, CompanyName, CompanyNickname" or  
twitter.mentions == "CompanyTwitter" or  
links.title contains "CompanyName" )

Removing the CSDL repetition like this will drop the cost of each sub-stream from 0.6 DPUs to 0.3 DPUs each, thus lowering the cost of the master stream from 1.2 DPUs to 0.9DPUs.

We are only reducing the DPU cost by 25% in this case, but if you were to remove this degree of repetition from a hundred sub-streams, this could save you a huge amount in DPU costs.


Repeating your arguments in CSDL is not necessary. By putting an argument in a “master stream”, that argument will trickle down and be applied to any subsequent sub-streams. This allows you to make your CSDL more maintainable, by just including the argument once, instead of once per sub-stream.

Choose your language carefully

We offer the language.tag target, allowing you to specify which languages you would like your interactions to be written in.

As an example, if you work for the French branch of the worldwide organization ‘AcmeCo’, you are unlikely to be interested in any Tweets about AcmeCo written in any languages other than French. You can use your CSDL to filter out languages you are not interested in like so:

interaction.content contains "AcmeCo" and  
language.tag == "fr"


When searching for a keyword such as a name, place or company which will be the same for any language, you will often receive a number of interactions that you can’t use because they are not written in your language. Adding the language augmentation to your CSDL can help stop you from receiving interactions in languages that are not relevant to your needs. This will make a minor difference in cost by reducing your license fees. If searching for a keyword such as “dictionary”, which only appears in one language, adding the language augmentation to your CSDL is unlikely to make much of a difference.

Do you really need all those Retweets?

A CSDL query like the following, will return all Tweets containing the keyword “coffee”, and all Retweets containing that keyword.

interaction.content contains "coffee"

If someone sends a hilarious Tweet about coffee, and it gets Retweeted 50 times, do you really need your filter to collect all of those Retweets, or are you just interested in the original Tweet? Conversely; Are you really interested in that initial Tweet, or do you just want to see the more popular Tweets that have been Retweeted 50 times or more?


You are charged for every Tweet you receive. Filtering out these unnecessary Tweets can save you a huge amount in license costs. If you don’t need to receive a large number of Retweets, you can simply filter them out. Consider using our twitter.text or twitter.retweet.text targets instead of interaction.content.

Ensure your interactions are relevant to your requirements

If, for example, you set up a stream looking for positive and negative sentiment about a certain political candidate, you might want to ensure the Tweets you were receiving were about the subject in question. One way you might want to do this is to use our Salience Topics augmentation:

interaction.content contains_any "Candidate One, Candidate Two" and  
salience.content.topics contains "Politics"

Doing this will ensure the Tweet has been written on the subject you are concerned with. For example, the above CSDL would ensure that the "Joe Bloggs" mentioned in a Tweet is the Joe Bloggs running for the local government election, not some other Joe Bloggs.

Consider another example: if you are running an advertising campaign to promote your latest feminine beauty products, are you just interested in what women are saying about your new product, or do you want to hear what men have to say about it too? You could add our demographic.gender target to your filter to only include the gender you are interested in hearing from.


Adding Salience Topics to your CSDL filter is a powerful tool which can help ensure you only collect relevant interactions from your filter. This technique can be used for a variety of use cases where you are only interested in collecting interactions from a certain demographic, or you are not interested in collecting interactions unless they contain a certain augmentation or value. Excluding interactions from your filter in this way is yet another method of reducing your license costs.


We have discussed five different methods of reducing either the cost of your CSDL filter or your license fees.

To make cheap and effective CSDL filters, you just need to spend a little time thinking about exactly what you really need to receive, and what you can do without. Run your new stream for a few minutes to see what kind of content your filter returns. You can use this to judge what needs to be filtered out or if any important keywords are missing from your filter.

Look at the interactions you are receiving. How do you intend to use them? Are there any interactions you can’t use to their full potential? Filter them out!

Full details of the DataSift billing model are available on the Understanding Billing documentation page.

Previous post: Salience 5

Next post: HubFlow - GitHub and the GitFlow Model Together