Streams within streams

Ed Stenson | 19th December 2011

The Curated Stream Definition Language (CSDL) in DataSift allows one stream to call another stream. The technique adds some interesting new possibilities. This blog post is a tutorial for beginners.

First, let's set up three example streams, keeping them very simple because we're just illustrating the concept at first.

Here's a stream to filter for posts that mention football.

interaction.content contains "football"

Here's one to filter for posts that mention golf.

interaction.content contains "golf"

And here's one to filter for posts that mention tennis.

interaction.content contains "tennis"

You can run these streams individually, of course, but you can also run them all at the same time to filter for posts about any of the three sports. First you'll need to find the hash for each stream. Go to the preview screen for the stream and click the green "Use stream" button.

Use%2520stream
  • The hash for our football stream is 34be61ef6e901da4fdab86dc5dfab972
  • The hash for our golf stream is 4c80fa58628239f6579496f0634dcecc
  • The hash for our tennis stream is 92fa0a3fad2de9c2ee0aa8112ac9341f

Here's the CSDL to reuse all three, combining them into one single stream. Notice that the code uses OR as the logical operator to pull in data from all the separate streams:

return {
    stream "34be61ef6e901da4fdab86dc5dfab972" or
    stream "4c80fa58628239f6579496f0634dcecc" or
    stream "92fa0a3fad2de9c2ee0aa8112ac9341f"
}

Categorizing the objects

An object delivered by this stream might relate to football but it might not, so it's useful to add a tag, a small piece of metadata that indicates which stream it came from. Here's how you do it.

tag "football" {stream "34be61ef6e901da4fdab86dc5dfab972"}
tag "golf"     {stream "4c80fa58628239f6579496f0634dcecc"}
tag "tennis"   {stream "92fa0a3fad2de9c2ee0aa8112ac9341f"}

return {
    stream "34be61ef6e901da4fdab86dc5dfab972" or
    stream "4c80fa58628239f6579496f0634dcecc" or
    stream "92fa0a3fad2de9c2ee0aa8112ac9341f"
}

The metadata tag appears in the "interaction" section of the output object. Here's a segment, taken from a sample JSON object:

{
    "interaction": {
        "source": "TweetDeck",
        "author": {
            "username": "stewarttownsend",
            "name": "Stewart Townsend",
            "id": 14065694,
            "avatar": "http://a2.twimg.com/profile_images/1302306721/twitterpic_normal.jpg",
            "link": "http://twitter.com/stewarttownsend"
        },
        "type": "twitter",
        "link": "http://twitter.com/stewarttownsend/statuses/136447843652214784",
        "created_at": "Tue, 15 Nov 2011 14:17:55 +0000",
        "content": "Morning San Francisco - 36 hours and counting.. #datasift",
        "id": "1e10f949c51aab80e074df944f5e8e46",
        "tags": [
            "football"
        ]
    },
    "twitter": {
        "user": {
            "name": "Stewart Townsend",
            "url": "http://www.stewarttownsend.com",

and so on....

These tags make your post-processing client code easy to write. For example, to derive data for a frequency chart, your client software would simply look out for the metatags and then perform a very simple calculation.

The client software communicates via one of DataSift's APIs. You can write in a variety of languages; we have client libraries in C#, Java, Node.JS, PHP, and Ruby, so you can perform your own analysis.

Documentation-postprocessing-diagram_v1%2520%25281%2529

A practical example

To round off, here's a real-world example of tagging in use. The first block of CSDL code filters for posts relating to top-selling artists in the UK this holiday season:

interaction.content near "Amy,Winehouse,Lioness,Hidden,Treasures:20" or
interaction.content contains "Amy Winehouse" or
interaction.content near "Michael,Buble,Christmas:20" or
interaction.content contains "Michael Buble" or
interaction.content near "Coldplay,Mylo,Xyloto:20" or
interaction.content contains "Coldplay" or
interaction.content near "Rihanna,Talk,That,Talk:20" or
interaction.content contains "Rihanna" or
interaction.content near "Olly,Murs,In,case,you,didn\'t,know:20" or
interaction.content contains "Olly Murs" or
interaction.content near "Adele,21:20" or
interaction.content contains "Adele" or
interaction.content near "Ed,Sheeran,+,(Plus):20" or
interaction.content contains "Ed Sheeran" or
interaction.content near "Little,Mix,Cannonball:20" or
interaction.content contains "Little Mix" or
interaction.content near "One,Direction,Up,All,Night:20" or
interaction.content contains "Amy Winehouse" or
interaction.content near "Rebecca,Ferguson,Heaven:20" or
interaction.content contains "Rebecca Ferguson"

The second block adds tags to that first stream, classifying the output objects as News, Audio, Video, or Image:

tag "News" { links.domain any "wordpress.com, blogspot.com, tumblr.com, typepad.com, livejournal.com, blogger.com" }

tag "Audio" { links.domain any "last.fm, open.spotify.com, twt.fm, hypem.com, soundcloud.com" or 
              links.url regex_partial "^.*\\\\.(mp3|MP3|wav|WAV|ogg|OGG|m4a|M4A|aac|AAC|m4p|M4P)$" }

tag "Video" { links.domain any "youtube.com, video.google.com, revver.com, vimeo.com, ustream.tv, qik.com" or
              links.url regex_partial "^.*\\\\.(avi|AVI|swf|SWF|mp4|MP4|mov|MOV|m4v|M4V)$" or
              links.url substr "application/x-shockwave-flash" }

tag "Image" { links.domain any "instagr.am, flickr.com, zooomr.com, twitpic.com, tinypic.com, twitgoo.com, mobypicture.com, tweetphoto.com, yfrog.com" or
              links.url regex_partial "^.*\\\\.(jpe?g|JPE?G|gif|GIF|png|PNG|tiff|TIFF)$" }

return {
    stream "0c0d42f8081db406e84032e8026bf99f" and
    links.url exists
}

How do I calculate the cost?

When you embed one stream inside another, the overall cost is simply the cost of the first stream plus the cost of the second stream. The stream and tag keywords don't add anything to the cost - they're free.

Take a look at Understanding Billing if you'd like to know more.

Want to learn more?

Visit our stream and tag pages.


Previous post: Regular Expressions

Next post: Migrating from the Twitter Streaming API