Introduction to Tagging

Ed Stenson | 11th November 2011

One of the advanced features of DataSift's Curated Stream Definition Language (CSDL) is tagging. The tag keyword allows you to add extra data to the objects coming from a stream, data which you can then use in subsequent analysis. A Tweet, for example, includes a wide range of data. There is the 140-character message, of course, but there is a lot more hidden inside. For example, every Tweet object contains the screen name of the author, how many people they follow, how many people follow them, and how many times the Tweet has been Retweeted. If the author used a geo-enabled device to send the Tweet (and if they turned on the geo functionality), the object will even tell you the author's latitude and longitude. The tag keyword allows you to use CSDL to add additional information.

How are tags used?

Suppose a research bureau wants to write a filter that examines Tweets about the automotive sector. Just for illustration, let's look at pickup trucks, muscle cars, and economy cars.

twitter.text contains_any "Ford, Chevy, Chevrolet, Dodge, Toyota"
and twitter.text contains_any "Silverado, Ram, F-150, F150, Camaro,
                               Mustang, Charger, Prius, Yaris"

It's simple to use tags to classify these vehicles. First, add a return statement around the code like this:

return {
twitter.text contains_any "Ford, Chevy, Chevrolet, Dodge, Toyota"
and twitter.text contains_any "Silverado, Ram, F-150, F150, Camaro,
                               Mustang, Charger, Prius, Yaris"
}

Then add the tags like this:

tag "pickup" {twitter.text contains_any "Silverado, Ram, F-150, F150"}
tag "muscle" {twitter.text contains_any "Camaro, Mustang, Charger"}
tag "economy" {twitter.text contains_any "Prius, Yaris"}

return {
twitter.text contains_any "Ford, Chevy, Chevrolet, Dodge, Toyota"
and twitter.text contains_any "Silverado, Ram, F-150, F150, Camaro,
                               Mustang, Charger, Prius, Yaris"
}

What makes tags so useful?

Now, when these objects flow out of DataSift's API into the analyst's client software, it's easy to pick out what they need. Want to check comments on trucks today? Easy!

There are plenty of other ways in which tagging can add value. The most immediate use for tagging would be when you watch a stream coming in through our user interface at datasift.com. The tags are immediately appended to the stream objects, and you can see them when you run the stream via the UI.

Another great way to use tags is to combine two streams in one. Want to monitor comments on Starbucks and Nike at the same time? Write a filter like this that does both jobs at once. The results are jumbled together but tags make it easy for you to tidy up.

tag "starbucks" {twitter.text contains "starbucks"}
tag "nike" {twitter.text contains_any "nike"}

return {
twitter.text contains_any "Starbucks, Nike"
}

Summing it all up

Where stream tagging really shines is in the post-processing stage of your data analysis. Just as augmentations (the link analyzer, for instance) enrich objects based on information supplied by third-party sites, tags enrich objects according to rules you define in your CSDL code. You can use these tags to extend filtering capabilities, split up your final results, or to help compile statistics about your stream.


Previous post: Datasift Blog RSS Feed

Next post: Introducing : Links Augmentations