Introducing Tags - Categorise Data To Fit Your Model

Richard Caudle | 16th December 2013

The launch of DataSift VEDO introduced new features to allow you add structure to social data. These new features allow you to add custom metadata to social interactions, saving you post-processing work in your application.

In this post I’ll introduce you to tagging, and explain why this will make working with social data a whole load easier. Tagging allows you to categorise data to match your business model. Keep watching this space as over the next few posts I’ll cover all of the new features in detail.

What Are Tags?

Tags are a simple but powerful way to add custom metadata to social interactions before they are delivered to your application. Once the platform has filtered your sources of social data using your filter, you can use the same language (CSDL) to add tags and classify interactions so saving you post-processing effort.

A Quick Example - Categorising User Devices

Device identification is incredibly useful when you’re trying to analyse audiences and how they interact. Using CSDL I can identify the device used to create the content and tag that interaction appropriately.

Let’s look at two interactions, the first from Twitter and the second from Bit.ly.

{
  "interaction": {
  "content": "Check out my holiday pics, Venice was beautiful!",
  "source": "Twitter for iPhone"
   }
}

{
  "interaction": {
    "link": "http://www.italia.it/en/discover-italy/veneto/venice.html"
   },
  "bitly": {
    "user": {
      "agent": "BlackBerry9700/5.0.0.862 Profile/MIDP-2.1 Configuration/CLDC-1.1 VendorID/144"
    }
  }
}

For a Twitter interaction the interaction.source target tells us which application was used to post the content. Whereas for bit.ly interactions the bitly.user.agent (the user-agent string) gives us a detailed profile of the browser or device used to post the link.

As different sources provide context information in a variety of formats and in different structures, writing application code to process this data is time consuming. By using tags we can simplify this task hugely and use the full power of CSDL to carry out this work.

I can use the tag keyword to add tags to my data above. Any interaction that matches the CSDL in the brackets will be given the declared tag. The syntax for declaring a tag is:

tag "[tag name]" { // CSDL to match interactions }

Carrying on my example I'll create three tags to apply to my data:

tag "iPhone" { bitly.user.agent substr "iPhone" OR interaction.source contains "iPhone" }
tag "Blackberry" { bitly.user.agent substr "Blackberry" OR interaction.source contains "Blackberry" }
tag "iOS" { bitly.user.agent substr "iPhone" OR bitly.user.agent substr "iPad" OR interaction.source contains_any "iPhone,iPad" }

// Return any posts that mention Venice or link to a page about Venice
return { 
  interaction.content contains "Venice" OR links.title contains "Venice"
}

In this definition I’m tagging interactions based on the user-agent and source properties. (Including both iOS and iPhone might seem strange, but this demonstrates that you can add multiple tags to an interaction!)

You’ll notice I’ve used the substr operator to inspect the user-agent field as often these strings are stripped of white space.

bitly.user.agent substr "Blackberry"

Will match the following:

BlackBerry9700/5.0.0.862 Profile/MIDP-2.1 Configuration/CLDC-1.1 VendorID/144  
BlackBerry8520/4.6.1.272 Profile/MIDP-2.0 Configuration/CLDC-1.1 VendorID/121

Whereas for the source property contains works perfectly because these values have a cleaner format.

interaction.source contains "iPhone"

Will match:

Twitter for iPhone  
UberSocial for iPhone

When the sample interactions pass through my definition the result will be:

{
   "interaction": {
      "content": "Check out my holiday pics, Venice was beautiful!",
      "source": "Twitter for iPhone",
      "tags": ["iPhone","iOS"]
   }
}

{
   "interaction": {
      "link": "http://www.italia.it/en/discover-italy/veneto/venice.html",
      "tags": ["Blackberry"]
   },
   "bitly": {
      "user": {
         "agent": "BlackBerry9700/5.0.0.862 Profile/MIDP-2.1 Configuration/CLDC-1.1 VendorID/144"
    }
   }
}

The first interaction has been given two tags because ‘iPhone’ is included in both the iPhone and iOS tags. Whereas the second interaction only matches the Blackberry tag.

When this data arrives at my application it is decorated with clean metadata. I can inspect the tags array and easily apply business rules rather than have to perform text processing.

Of course, I could extend my definition to cover many more data sources and devices, but regardless of the complexity CSDL gives us the power to classify interactions and deliver structured data to applications.

And That’s Just The Start!

My example covered just one scenario where tagging can be extremely effective and efficient.
Our latest release takes tagging to the next level allowing you to tag and numerically score interactions, and to build reusable tag taxonomies fit for complex use cases. I’ll be explaining these new features in detail in my next few posts, so watch this space.

For inspiration check out our library of tag definitions. You can import these tags into your streams right now. And for full details of the features see our technical documentation.


Previous post: Announcing DataSift VEDO - Giving Structure To Social Data

Next post: Keeping Tags Organised Using Namespaces