Newcomer's Guide to DataSift's Streaming API

Ed Stenson | 18th June 2012

A couple of days ago, a Data Scientist friend asked me how to stream data through DataSift's APIs. He'd been recording his streams and then analyzing the JSON or CSV output. Now, he wants to go to the next level. In other words, he's already familiar with creating streams and running them.

Here are the steps I recommended:

  1. First, write the CSDL code for the stream. For example:

    twitter.text contains "Data Cola"

    You can do this via an API call to the /compile endpoint in our DataSift's REST API or in the UI. I find it convenient to write in the UI because I tend to create complex filters which require long CSDL files.

  2. To run a stream in the API you need to supply three things:

    • the hash for your CSDL
    • your DataSift username
    • your API key

    If you compile your CSDL via the REST API, the /compile endpoint returns a JSON object that contains the hash. That's the only way you can find it. Alternatively, if you create your CSDL in the UI, just hit the green "Use stream" button and DataSift displays the hash along with your username and API key.

  3. If you're using the HTTP protocol, the streaming endpoint is Let's try it out. In the address bar of your browser, paste:

    You'll receive a JSON object with a warning that you didn't supply a username or API key:

    {"status":"failure","message":"A username and API key are both required"}

    A valid call looks like this:<hash>?username=<yourusername>&api_key=<yourapikey>

    If you're not used to making API calls, take particular note of the position of the ? and & symbols.

    Here's a real example:

    My username here is DataSift (it IS case sensitive) but this call won't run because the API key that I've shown for illustration is out of date. However, if you substitute your own hash, username, and API key and run this call in a browser, you'll see your data streaming in real time.

  4. Now that we've tested the call in a browser, let's try running it for real. I'll use a terminal in Linux for this example. If you're a Mac user, you can do the same. On Windows machines, I find it easiest to boot into Ubuntu from a USB stick. At the Linux command line, you cannot directly call an API endpoint. Instead, you have to use a cURL request. It's very simple, just add 'curl' before the call like this:


    I usually use the -s flag which puts cURL into silent mode and limits the diagnostic messages that it reports. I'm told it's a good idea to place the http call in single quotes like this:

    curl -s ''

    Those are the main things you need to understand when you want to run DataSift's streaming API.

  5. There are a few last things to say. I often write scripts in PHP or Python to analyze the output of my streams. It's very easy to use the pipe mechanism to feed the output of a stream into a script. Here's an example, which simply counts how many interactions I've received: curl -s '' | python

    Or, I might want to find all the blog domains that DataSift receives, and store them in a text file. In that case, I use the pipe, just as before, and then redirect the output of my script to a file:

    curl -s '' | php blogs.php > blogdomains.txt


Streaming via DataSift's API is very easy. There's a lot more to learn but the techniques that we've covered here are sufficient for use in a production environment.

To learn more, visit our documentation suite where you can learn how to run multiple streams, study all the REST API endpoints, and discover the Historics API endpoints that allow you to run playback queries on our archive. If you prefer the WebSockets protocol instead of HTTP, we have WS:// endpoints for WebSockets streaming, and there are endpoints that help you to track your usage and dpu consumption, too.

Have fun!

Previous post: HubFlow - GitHub and the GitFlow Model Together

Next post: Language Detection v2.0