Blog posts in Case studies

Jason's picture

Migrating from the Twitter Streaming API

When comparing the DataSift streaming API to the Twitter streaming API, some people have noticed differences in the number of Tweets being returned. The main differences occur when using the Twitter Streaming API "track" method to search for keywords.

If we were to search the Twitter stream using the following request:

    curl -d "track=techcrunch" "" -uMyScreenName

we may expect to receive the same results by running the following CSDL:

However, the Twitter track method actually filters on more than just the Twitter text. It also filters on the Twitter screen name, the URL originally entered into the Tweet, and this same data in the form of Retweets.

A more accurate CSDL rule to capture the same data as the Twitter track stream would be the following:

Unfortunately it is not currently possible to make the DataSift stream exactly match the Twitter track stream. The reason for this is down to the way Twitter and DataSift handle links resolution. Any Tweeted links will be shortened to a link. On delivery, Twitter will then resolve the link to the original URL that was entered into the Tweet (regardless of whether this was a fully resolved link, or a link shortened by another service like or DataSift however will fully resolve the link to its final URL on delivery. As an example, if you were to perform a search for links from, the following set of shortened URLs would be picked up by Twitter track but not by DataSift: -- Twitter shortened link -- The original Tweeted link. Twitter only unravels links this far -- DataSift's fully resolved link

Using this same example, if we were to search the URL for "Zynga", it would be picked up by DataSift but not by Twitter track.


One of the major benefits DataSift offers over the Twitter track method is that we can look further into the data and pick up many Tweets the Twitter streaming API may not be able to pick up for various reasons. Below is a much deeper CSDL rule to filter for even more posts from Twitter and our other data sources, that contain references to TechCrunch:

So, the above CSDL rule will now look for all interactions (from all data sources; Twitter, Facebook and Wikipedia amongst others) containing the word TechCrunch, any Tweets that mention the @TechCrunch user, and Retweets mentioning TechCrunch, and any links (again, from all data sources) which contain TechCrunch in either the URL or page title.

datasift's picture

Monitoring Eurozone sentiment for just 20 cents an hour


I wrote a stream today to monitor social media commentary on the meeting between German Chancellor Angela Merkel and French President Nicolas Sarkozy in Paris. It’s “the start of a crucial week for the Eurozone,” one report read, and it almost sounded like understatement. Definitely, sentiment is going to be worth watching today.

Let’s ask some questions about this stream:

  1. What would it cost to run for 24 hours?
  2. How is the cost calculated?
  3. What are all those tags for?
  4. How much output does the stream deliver?



What does it cost?

The filter has 14 lines of code but the good news is that only one of them is chargeable. Here it is:

    interaction.content contains_any "Merkel, Sarkozy, #Merkel, #Sarkozy, #euro, #Osbourne"

The other lines cost you nothing at all.


Our Understanding Billing page shows the way DataSift calculates costs for each operator. Here we're using the contains_any operator with 6 arguments:

    "Merkel, Sarkozy, #Merkel, #Sarkozy, #euro, #Osbourne"

The billing documentation indicates that the cost would be 0.2 DPUs per hour.

In fact, you can include up to 10 arguments and still only pay 0.2 DPU. 


What's a DPU?

The simplest way to think about it is:

  1. DPUs are a measure of cost per hour to run a stream.
  2. A DPU is currently equivalent to 20 US cents.

Now, DataSift's minimum charge is 1 DPU per hour. Hence, the overall cost to run our Eurozone stream is 20 cents per hour, or $4.80 for an entire 24 hour's worth of focused data.


What are those tags for?

They're a feature of CSDL that allows you to add metadata on a conditional basis. For example:

    tag "Klout 20+" { klout.score >= 20 AND klout.score < 30 }

This command adds a tag "Klout 20+" to every object that comes from a user who has a Klout score between 20 and 29.

Most real-world applications built on DataSift, use our API and one of our client libraries. After DataSift has passed objects to your client application, you own code can examine the metadata and perform any analysis you choose. For instance, it would be very easy to generate a bar chart with frequency on the vertical (y) axis and Klout range on the horizontal (x) axis.


How much data does this stream produce?

We wrote a few simple lines of PHP to sample the stream for 30 minutes and received 2,727 objects.

datasift's picture

Standard and Poor's Downgrades US Banks

Here's a filter that collects comments and sentiment on Standard and Poor's downgrade of US banks. With DataSift, it's easy to filter out the insignificant content and focus on the things that are being retweeted, the thoughts from key players, and the comments that have strong sentiment.



It delivers posts that include "credit rating(s)" and mention "bank(s)" or six specific major US banks.
It restricts the output to posts written in English.
It restricts the output to posts that include some non-trivial sentiment, or posts that have been retweeted more than five times, or posts written by authors with a significant Klout score. 
And, finally, it tags each object for positive or negative sentiment.
The cost is 1.1 DPU which means that this stream is far from expensive to run. If you ran it for an hour and it delivered 1,000 Tweets, the price would  be:
     1.1 * 0.20 + 1000 * 0.0001 = $0.32

What you can do next

Or use DataSift's powerful API to consume this data and perform your own processing on it, such as post-processing those sentiment tags that we added.