Richard Caudle's picture

Build Better Social Solutions Faster with the DataSift Library

The launch of DataSift VEDO introduced new features to allow you to add structure to social data. Alongside VEDO we also introduced the DataSift library - a great new resource to inspire you and make your life easier. Benefit from our experience and build better solutions faster.

What? Why? Where?

We've introduced the DataSift library to help you to benefit from our experience working with our customers. CSDL (our filtering and tagging language) is extremely powerful, but it might not be clear exactly what can be achieved. Using the library we'll share with you definitions we've written for real-world solutions so you can learn quicker and get the most from our platform.

Currently the library contains tagging and scoring definitions that demonstrate the power of VEDO. There are out-of-the-box components you can use straight away, and example rules you can take and run with.

You'll find the new Library as a new tab on the Streams page:



Supported Out-Of-The-Box Components

Items marked as 'supported' in the library are definitions you can count on us to maintain. You can confidently use these as part of your solution immediately.

You can also use these definitions as a base to start from. You can copy the definitions into your account and modify the rules to fit your use case. After all 'spam' for one use case can be gold for another!

Supported items include:

  • Competitions & Marketing: Scores content to say how likely it is to be noise from competitions and marketing campaigns.
  • Twitter Application Used: Identifies and categorises the source application used to create a tweet - great for picking out content from services, real users and bots.
  • Professions & Roles: Where possible identifies user's profession and seniority based on their profile description.


Real-World Example Solutions

Items marked as 'example' in the library are definitions which we've built that will help you learn from real-world samples. You can run these examples directly, but we envisage you using these definitions as starting points and modifying or extending them to fit your solution.

Example items include:


Using a Library Item

It's easy to make use of a library item. You can either import an item into one of your streams, or copy an item into your account and modify it to your heart's content.

Note that all of the library items are currently tagging and scoring rules. You'll need to use them with a return statement. For more details please see our technical documentation.

Importing a definition

At the bottom of the page for each library item you'll find a tab labelled Usage Examples. This tab shows you example code which you can copy and paste into a new stream and run a live preview.


The key here is the tags keyword and hash for the stream. You can copy and paste this line into any of your streams to import the tagging rules.

Copying a definition

On each library item page there is a snippet of code that shows you the entire, or part of the definition. You can click the Copy to new stream button to copy the entire definition to your account. You can then inspect and modify the code as you see fit.



More To Follow

We'll work hard on adding more and more items to the library so it becomes an extremely valuable resource. Keep an eye out for new items and please watch this blog for further news.

At this point let me encourage you to sign-up if you're not already a DataSift user, and then jump straight in the library and see it can make your life easier!

Richard Caudle's picture

Build Reusable Tagging And Scoring Rules To Use Across Your Projects

The launch of DataSift VEDO introduced new features to allow you to add structure to social data. In my last few posts I introduced you to tags, tag namespaces and scoring and explained how you can use these features to classify data before it reaches your application.

In this post I will show you how once you’ve spent time building these rules, you can reuse them across many projects, getting maximum value for your hard work.

Creating A Reusable Tag Definition

On our platform a ‘tag definition’ is a stream you define which contains only tag rules, and no return statement.

To be clear, until now you may have used tags and a filter (as a return statement) together in one stream for example:

You can break out your tags into a reusable tag definition by saving just the tags in a new stream without the return statement:

When you save the stream (or compile using the API) you will be given a hash which represents this definition.

Using The Definition In A Stream

Now that we have a hash for the tag definition, we can make use of it in another stream using the tags keyword.

This will import the tag rules just as if they were written in place in the same stream. The tags will be applied as appropriate to interactions that match the filter in the return statement.

Using this simple but powerful feature you can create a library of valuable tag & scoring rules for your business model and reuse the same definitions across any number of streams.

Applying A Namespace

This is already a great feature but things get even better when we throw in a namespace. Imagine you have a great set of tag rules you like to reuse often in your streams, you might want to organise your namespaces differently depending on the exact stream.

When you import a set of tags you can wrap them within a namespace:

Now the tags will be applied as before, but they will sit within a top-level namespace of user.

In fact, in practice I’ve found that this helps keep tag rules much more concise. You can declare tag rules with a shallow namespace, but when you import them you can wrap them in namespaces to build a very rich taxonomy.

Something To Remember

It’s important to note that the hash for your tag definition will change if you update the tag definition itself. If you’ve used the stream keyword before you’ll be familiar with this concept.

This makes sense when you consume streams via the API, allowing you to make changes to definitions on the fly and switching to new definitions when suitable for your application.

You just need to remember that if you make a change to a reusable tag definition, make sure you take the new hash and update the streams which import the definition.

Let’s Not Stop There Though…

Reusable tag definitions are super powerful because they allow you to build up a library of rules which you can use across projects and throughout your organisation.

For example, you could build the following reusable definitions:

  • A spam classifier tailored perfectly to your use case
  • A rich taxonomy to exactly fit your business model or industry
  • An influencer model to use across many projects

To give you a head start we’ve also released our own library of reusable definitions. In minutes you can benefit from our hard work!

For full details of the all the features see our technical documentation.

This post concludes my series on our new tagging and scoring features. Don’t go away though as there are many more features I’ll cover in the coming weeks, and I’ll also take you through some much richer real-world examples.

Richard Caudle's picture

Introducing Scoring - Attach Confidence, Quality And Rank To Your Social Data

The launch of DataSift VEDO introduced new features to allow you to add structure to social data. In my last posts I introduced you to tags and tag namespaces and explained how you can use these features to categorise data before it reaches your application.

Alongside tagging our latest release also introduces scoring, which I’ll cover today. Scoring builds upon tagging by allowing you to attach relative numeric values to interactions. Scoring allows you to reflect the subtleties of quality, priority and confidence, opening up a whole raft of new use cases for your social data.

What Is Scoring?

Whereas tags allow you to attach a metadata label, based on a boolean condition, scoring allows you to build up a numerical score for an interaction over a set of rules. Tags are perfect for classifying into a taxonomy, whereas scoring gives you the power to rate or qualify interactions on a relative scale. Just like tagging, you can apply scores using the full power of CSDL, and save yourself post-processing in your application.

An Example - Identifying Bots

Often our customers look to identify spam in their data to improve analysis. What is considered to be spam is extremely dependent on the use case, and even on the specific question being asked. Here I’m going to rate tweets for how likely they are to be from a bot.

Using scoring I can give each tweet a relative confidence level for suspected bot content. Rather than just tagging a tweet, scoring allows an end application to further the filter on the relative score.

Let’s look at two (stripped down) tweets, the first likely to be a real person, and the second a likely bot:


There are a number of useful metadata properties I can work with to build my rules:

  • interaction.source - For Twitter this is the application that made the post. Here we can look for well-known bot services.
  • twitter.user.created_at - When the profile was created. A recently created profile is likely to be a bot. (Note that the platform calculates a new target called profile_age inferred from the created_at date, which makes our lives easier in CSDL.)
  • twitter.user.statuses_count - How many statuses the user has posted. A bot account is likely to have been created recently and sent few messages.
  • twitter.user.follower_count - The number of users that follow this user. A bot account will often follow many people but not be followed by many users itself.
  • twitter.user.friends_count - The number of people the user follows. (Again the platform gives us extra help by using follower_count and friends_count to calculate a new property called follower_ratio.)
  • twitter.user.verified - A value of 1 tells us that the account is verified by Twitter, so definitely not a bot!

Any one of these qualities hints that the content could be from a bot. The beauty of scoring though is that if we see more than one of these qualities then we can show that we’re more and more confident the tweet is from a bot.


Scoring allows us to build up a final score over multiple rules. For every rule that matches the rule score is added or subtracted.

You use the tag keyword to declare scoring rules, with the following syntax:

tag.[namespace].[tag name] +/-[score]  { // CSDL to match interactions }

Any interaction that matches the CSDL in the brackets will have the declared score added or subtracted.

Let’s carry on the example to make this a bit clearer. Notice how I have multiple rules which contribute different scores to the same tag:

When the rules are applied to the two tweets the output is:


For the first interaction the total is 10 as it only matches this rule relating to the number of statuses posted:

tag.quality.suspected_bot +10 { twitter.user.statuses_count < 25 }

For the second interaction the total is 80 because the following rules all match:

tag.quality.suspected_bot +25 { twitter.user.profile_age < 5 }
tag.quality.suspected_bot +10 { twitter.user.statuses_count < 25 }
tag.quality.suspected_bot +25 { twitter.user.follower_ratio < 0.01 }
tag.quality.suspected_bot +20 { interaction.source contains_any ",EasyBotter,," }

So our relative scores tell us the first interaction is unlikely to be from a bot, whereas the second is highly likely to be so.


When the data reaches my application I can choose how strict I’d like to be about spam. I might say the score must be 0 or below to be absolutely sure the content is not from a bot, or I might perhaps choose a value of 25 say to include content unlikely to be from a bot. The power is given to the end application or user to make the choice using a simple scale.

But Wait, There’s More...

My example covered just one scenario where scoring can be helpful, but there are many more. For example:

  • Measuring purchase intent, rating leads
  • Measuring sentiment and emotion
  • Rating content authors’ authority
  • Identifying influencers

For inspiration check out our library of tag definitions. For full details of the features see our technical documentation.

In my next post I’ll be explaining how you can reuse all the hard work you’re putting into your tagging and scoring rules by importing these definitions into any number of streams.

Richard Caudle's picture

Keeping Tags Organised Using Namespaces

The launch of DataSift VEDO introduced new features to allow you to add structure to social data. In my last post I introduced you to tagging and explained how you can use the feature to categorise data before it reaches your application.

In this post I'll introduce you to tag namespaces, a simple and elegant way to organise tags sets. Many of our customers have built tag sets containing hundreds of tags for their project. As you use tagging more and more you'll find that namespaces are a great way to keep your tags clean and structured.

What Is A Tag Namespace?

In my last post I showed you how to declare a tag, for example:

tag "iPhone" { bitly.user.agent substr "iPhone" OR interaction.source contains "iPhone" }

This is a great start, but as our customers have increasingly adopted tagging they’ve ended up with hundreds of tags without any structure.

Say I have some tags which identify a user’s device, but alongside I have tags which identify companies, it would be great if I could break any matching tags into groups. Just like a well organised set of code, where you use namespaces to organise classes into function or a business model, you can do exactly the same with tags by using namespaces.

How To Add A Namespace...

You can add a namespace to a tag using this syntax:

tag.[namespace] "[tag name]" { // CSDL filter }

For example:

tag.device "iPhone" { interaction.source contains "iPhone" }

That’s one level of namespace, but why stop there? "iPhone" { interaction.source contains "iPhone" }

I’m sure you get the idea. For most use cases 2 or 3 levels will no doubt do the trick.

In The Real World...

So maybe I’d like to track conversations around some companies, I can (for a simple example) use their stock symbols. When I get this data in my application though it would be great if the companies were grouped for me by index.

This is really easy with tag namespaces:

If somebody sends a tweet from their iPhone about Nike, this is what I'll receive in my app:

When I receive this data it’s nice and easy to look into the tag tree and split my data into buckets, or run logic as necessary.

Tag_tree vs Tags

If you used tags before you’ll notice that instead of the tags being output in the ‘tags’ property of interaction, they now appear in the ‘tag_tree’ property. This allows us to keep backward compatibility for existing customers using tags without namespaces.

See our docs for a full explanation of this change.

And There’s More!

So we’ve covered a quick example, but of course you can take things much further. You can use namespaces to build rich deep taxonomies to cover your business model. For inspiration check out our library of tag definitions. You can import these tags into your streams right now.

For full details of the features see our technical documentation.

Next time I’ll be looking at scoring, a way to give relative numerical values to interactions. This is a great feature for modelling priorities and confidence scoring cases.

If you’re new to DataSift, what’s stopping you? Register now and feast from the world of social data.

Richard Caudle's picture

Introducing Tags - Categorise Data To Fit Your Model

The launch of DataSift VEDO introduced new features to allow you add structure to social data. These new features allow you to add custom metadata to social interactions, saving you post-processing work in your application.

In this post I’ll introduce you to tagging, and explain why this will make working with social data a whole load easier. Tagging allows you to categorise data to match your business model. Keep watching this space as over the next few posts I’ll cover all of the new features in detail.

What Are Tags?

Tags are a simple but powerful way to add custom metadata to social interactions before they are delivered to your application. Once the platform has filtered your sources of social data using your filter, you can use the same language (CSDL) to add tags and classify interactions so saving you post-processing effort.

A Quick Example - Categorising User Devices

Device identification is incredibly useful when you’re trying to analyse audiences and how they interact. Using CSDL I can identify the device used to create the content and tag that interaction appropriately.

Let’s look at two interactions, the first from Twitter and the second from

For a Twitter interaction the interaction.source target tells us which application was used to post the content. Whereas for interactions the bitly.user.agent (the user-agent string) gives us a detailed profile of the browser or device used to post the link.

As different sources provide context information in a variety of formats and in different structures, writing application code to process this data is time consuming. By using tags we can simplify this task hugely and use the full power of CSDL to carry out this work.


I can use the tag keyword to add tags to my data above. Any interaction that matches the CSDL in the brackets will be given the declared tag. The syntax for declaring a tag is:

tag "[tag name]" { // CSDL to match interactions }


Carrying on my example I'll create three tags to apply to my data:

In this definition I’m tagging interactions based on the user-agent and source properties. (Including both iOS and iPhone might seem strange, but this demonstrates that you can add multiple tags to an interaction!)


You’ll notice I’ve used the substr operator to inspect the user-agent field as often these strings are stripped of white space. 

bitly.user.agent substr "Blackberry"

Will match the following:

BlackBerry9700/ Profile/MIDP-2.1 Configuration/CLDC-1.1 VendorID/144
BlackBerry8520/ Profile/MIDP-2.0 Configuration/CLDC-1.1 VendorID/121


Whereas for the source property contains works perfectly because these values have a cleaner format.

interaction.source contains "iPhone"

Will match:

Twitter for iPhone
UberSocial for iPhone


When the sample interactions pass through my definition the result will be:

The first interaction has been given two tags because ‘iPhone’ is included in both the iPhone and iOS tags. Whereas the second interaction only matches the Blackberry tag.

When this data arrives at my application it is decorated with clean metadata. I can inspect the tags array and easily apply business rules rather than have to perform text processing.

Of course, I could extend my definition to cover many more data sources and devices, but regardless of the complexity CSDL gives us the power to classify interactions and deliver structured data to applications.

And That’s Just The Start!

My example covered just one scenario where tagging can be extremely effective and efficient.
Our latest release takes tagging to the next level allowing you to tag and numerically score interactions, and to build reusable tag taxonomies fit for complex use cases. I’ll be explaining these new features in detail in my next few posts, so watch this space. 

For inspiration check out our library of tag definitions. You can import these tags into your streams right now. And for full details of the features see our technical documentation.

If you’re new to DataSift, what’s stopping you? Register now and experience the power of our platform for yourself!!


Subscribe to Datasift Documentation Blog