Migrating from the Twitter Streaming API

Jason | 9th January 2012

When comparing the DataSift streaming API to the Twitter streaming API, some people have noticed differences in the number of Tweets being returned. The main differences occur when using the Twitter Streaming API "track" method to search for keywords.

If we were to search the Twitter stream using the following request:

curl -d "track=techcrunch" "https://stream.twitter.com/1/statuses/filter.json" -uMyScreenName

we may expect to receive the same results by running the following CSDL:

twitter.text contains "TechCrunch"

However, the Twitter track method actually filters on more than just the Twitter text. It also filters on the Twitter screen name, the URL originally entered into the Tweet, and this same data in the form of Retweets.

A more accurate CSDL rule to capture the same data as the Twitter track stream would be the following:

twitter.text contains "TechCrunch" or
twitter.domains contains "TechCrunch" or
twitter.mentions contains "TechCrunch" or
twitter.retweet.text contains "TechCrunch" or
twitter.retweet.domains contains "TechCrunch"

Unfortunately it is not currently possible to make the DataSift stream exactly match the Twitter track stream. The reason for this is down to the way Twitter and DataSift handle links resolution. Any Tweeted links will be shortened to a t.co link. On delivery, Twitter will then resolve the link to the original URL that was entered into the Tweet (regardless of whether this was a fully resolved link, or a link shortened by another service like bit.ly or goo.gl). DataSift however will fully resolve the link to its final URL on delivery. As an example, if you were to perform a search for links from feedproxy.google.com, the following set of shortened URLs would be picked up by Twitter track but not by DataSift:

http://t.co/TU7L2pHo -- Twitter shortened link

http://feedproxy.google.com/~r/Techcrunch/~3/dskZ2POuTqE/ -- The original Tweeted link. Twitter only unravels links this far

http://techcrunch.com/2012/01/04/zynga-brings-social-gameplay-to-concealed-object-puzzles-with-newest-facebook-title-hidden-chronicles/ -- DataSift's fully resolved link

Using this same example, if we were to search the URL for "Zynga", it would be picked up by DataSift but not by Twitter track.

One of the major benefits DataSift offers over the Twitter track method is that we can look further into the data and pick up many Tweets the Twitter streaming API may not be able to pick up for various reasons. Below is a much deeper CSDL rule to filter for even more posts from Twitter and our other data sources, that contain references to TechCrunch:

interaction.content contains "TechCrunch" or
twitter.mentions contains "TechCrunch" or
twitter.retweet.text contains "TechCrunch" or
links.url contains "TechCrunch" or
links.title contains "TechCrunch"

So, the above CSDL rule will now look for all interactions (from all data sources; Twitter, Facebook and Wikipedia amongst others) containing the word TechCrunch, any Tweets that mention the @TechCrunch user, and Retweets mentioning TechCrunch, and any links (again, from all data sources) which contain TechCrunch in either the URL or page title.

Previous post: Streams within streams

Next post: Salience 5