Blog

Richard Caudle's picture
Richard Caudle
Updated on Wednesday, 18 December, 2013 - 09:20

The launch of DataSift VEDO introduced new features to allow you to add structure to social data. In my last posts I introduced you to tags and tag namespaces and explained how you can use these features to categorise data before it reaches your application.

Alongside tagging our latest release also introduces scoring, which I’ll cover today. Scoring builds upon tagging by allowing you to attach relative numeric values to interactions. Scoring allows you to reflect the subtleties of quality, priority and confidence, opening up a whole raft of new use cases for your social data.

What Is Scoring?

Whereas tags allow you to attach a metadata label, based on a boolean condition, scoring allows you to build up a numerical score for an interaction over a set of rules. Tags are perfect for classifying into a taxonomy, whereas scoring gives you the power to rate or qualify interactions on a relative scale. Just like tagging, you can apply scores using the full power of CSDL, and save yourself post-processing in your application.

An Example - Identifying Bots

Often our customers look to identify spam in their data to improve analysis. What is considered to be spam is extremely dependent on the use case, and even on the specific question being asked. Here I’m going to rate tweets for how likely they are to be from a bot.

Using scoring I can give each tweet a relative confidence level for suspected bot content. Rather than just tagging a tweet, scoring allows an end application to further the filter on the relative score.

Let’s look at two (stripped down) tweets, the first likely to be a real person, and the second a likely bot:

 

There are a number of useful metadata properties I can work with to build my rules:

  • interaction.source - For Twitter this is the application that made the post. Here we can look for well-known bot services.
  • twitter.user.created_at - When the profile was created. A recently created profile is likely to be a bot. (Note that the platform calculates a new target called profile_age inferred from the created_at date, which makes our lives easier in CSDL.)
  • twitter.user.statuses_count - How many statuses the user has posted. A bot account is likely to have been created recently and sent few messages.
  • twitter.user.follower_count - The number of users that follow this user. A bot account will often follow many people but not be followed by many users itself.
  • twitter.user.friends_count - The number of people the user follows. (Again the platform gives us extra help by using follower_count and friends_count to calculate a new property called follower_ratio.)
  • twitter.user.verified - A value of 1 tells us that the account is verified by Twitter, so definitely not a bot!

Any one of these qualities hints that the content could be from a bot. The beauty of scoring though is that if we see more than one of these qualities then we can show that we’re more and more confident the tweet is from a bot.

 

Scoring allows us to build up a final score over multiple rules. For every rule that matches the rule score is added or subtracted.

You use the tag keyword to declare scoring rules, with the following syntax:

tag.[namespace].[tag name] +/-[score]  { // CSDL to match interactions }

Any interaction that matches the CSDL in the brackets will have the declared score added or subtracted.

Let’s carry on the example to make this a bit clearer. Notice how I have multiple rules which contribute different scores to the same tag:

When the rules are applied to the two tweets the output is:

 

For the first interaction the total is 10 as it only matches this rule relating to the number of statuses posted:

tag.quality.suspected_bot +10 { twitter.user.statuses_count < 25 }

For the second interaction the total is 80 because the following rules all match:

tag.quality.suspected_bot +25 { twitter.user.profile_age < 5 }
tag.quality.suspected_bot +10 { twitter.user.statuses_count < 25 }
tag.quality.suspected_bot +25 { twitter.user.follower_ratio < 0.01 }
tag.quality.suspected_bot +20 { interaction.source contains_any "twittbot.net,EasyBotter,hellotxt.com,dlvr.it" }

So our relative scores tell us the first interaction is unlikely to be from a bot, whereas the second is highly likely to be so.

 

When the data reaches my application I can choose how strict I’d like to be about spam. I might say the score must be 0 or below to be absolutely sure the content is not from a bot, or I might perhaps choose a value of 25 say to include content unlikely to be from a bot. The power is given to the end application or user to make the choice using a simple scale.

But Wait, There’s More...

My example covered just one scenario where scoring can be helpful, but there are many more. For example:

  • Measuring purchase intent, rating leads
  • Measuring sentiment and emotion
  • Rating content authors’ authority
  • Identifying influencers

For inspiration check out our library of tag definitions. For full details of the features see our technical documentation.

In my next post I’ll be explaining how you can reuse all the hard work you’re putting into your tagging and scoring rules by importing these definitions into any number of streams.

Richard Caudle's picture
Richard Caudle
Updated on Tuesday, 17 December, 2013 - 09:56

The launch of DataSift VEDO introduced new features to allow you to add structure to social data. In my last post I introduced you to tagging and explained how you can use the feature to categorise data before it reaches your application.

In this post I'll introduce you to tag namespaces, a simple and elegant way to organise tags sets. Many of our customers have built tag sets containing hundreds of tags for their project. As you use tagging more and more you'll find that namespaces are a great way to keep your tags clean and structured.

What Is A Tag Namespace?

In my last post I showed you how to declare a tag, for example:

tag "iPhone" { bitly.user.agent substr "iPhone" OR interaction.source contains "iPhone" }

This is a great start, but as our customers have increasingly adopted tagging they’ve ended up with hundreds of tags without any structure.

Say I have some tags which identify a user’s device, but alongside I have tags which identify companies, it would be great if I could break any matching tags into groups. Just like a well organised set of code, where you use namespaces to organise classes into function or a business model, you can do exactly the same with tags by using namespaces.

How To Add A Namespace...

You can add a namespace to a tag using this syntax:

tag.[namespace] "[tag name]" { // CSDL filter }

For example:

tag.device "iPhone" { interaction.source contains "iPhone" }

That’s one level of namespace, but why stop there?

tag.device.model.apple "iPhone" { interaction.source contains "iPhone" }

I’m sure you get the idea. For most use cases 2 or 3 levels will no doubt do the trick.

In The Real World...

So maybe I’d like to track conversations around some companies, I can (for a simple example) use their stock symbols. When I get this data in my application though it would be great if the companies were grouped for me by index.

This is really easy with tag namespaces:

If somebody sends a tweet from their iPhone about Nike, this is what I'll receive in my app:

When I receive this data it’s nice and easy to look into the tag tree and split my data into buckets, or run logic as necessary.

Tag_tree vs Tags

If you used tags before you’ll notice that instead of the tags being output in the ‘tags’ property of interaction, they now appear in the ‘tag_tree’ property. This allows us to keep backward compatibility for existing customers using tags without namespaces.

See our docs for a full explanation of this change.

And There’s More!

So we’ve covered a quick example, but of course you can take things much further. You can use namespaces to build rich deep taxonomies to cover your business model. For inspiration check out our library of tag definitions. You can import these tags into your streams right now.

For full details of the features see our technical documentation.

Next time I’ll be looking at scoring, a way to give relative numerical values to interactions. This is a great feature for modelling priorities and confidence scoring cases.

If you’re new to DataSift, what’s stopping you? Register now and feast from the world of social data.

Richard Caudle's picture
Richard Caudle
Updated on Monday, 16 December, 2013 - 12:54

The launch of DataSift VEDO introduced new features to allow you add structure to social data. These new features allow you to add custom metadata to social interactions, saving you post-processing work in your application.

In this post I’ll introduce you to tagging, and explain why this will make working with social data a whole load easier. Tagging allows you to categorise data to match your business model. Keep watching this space as over the next few posts I’ll cover all of the new features in detail.

What Are Tags?

Tags are a simple but powerful way to add custom metadata to social interactions before they are delivered to your application. Once the platform has filtered your sources of social data using your filter, you can use the same language (CSDL) to add tags and classify interactions so saving you post-processing effort.

A Quick Example - Categorising User Devices

Device identification is incredibly useful when you’re trying to analyse audiences and how they interact. Using CSDL I can identify the device used to create the content and tag that interaction appropriately.

Let’s look at two interactions, the first from Twitter and the second from Bit.ly.

For a Twitter interaction the interaction.source target tells us which application was used to post the content. Whereas for bit.ly interactions the bitly.user.agent (the user-agent string) gives us a detailed profile of the browser or device used to post the link.

As different sources provide context information in a variety of formats and in different structures, writing application code to process this data is time consuming. By using tags we can simplify this task hugely and use the full power of CSDL to carry out this work.

 

I can use the tag keyword to add tags to my data above. Any interaction that matches the CSDL in the brackets will be given the declared tag. The syntax for declaring a tag is:

tag "[tag name]" { // CSDL to match interactions }

 

Carrying on my example I'll create three tags to apply to my data:

In this definition I’m tagging interactions based on the user-agent and source properties. (Including both iOS and iPhone might seem strange, but this demonstrates that you can add multiple tags to an interaction!)

 

You’ll notice I’ve used the substr operator to inspect the user-agent field as often these strings are stripped of white space. 

bitly.user.agent substr "Blackberry"

Will match the following:

BlackBerry9700/5.0.0.862 Profile/MIDP-2.1 Configuration/CLDC-1.1 VendorID/144
BlackBerry8520/4.6.1.272 Profile/MIDP-2.0 Configuration/CLDC-1.1 VendorID/121

 

Whereas for the source property contains works perfectly because these values have a cleaner format.

interaction.source contains "iPhone"

Will match:

Twitter for iPhone
UberSocial for iPhone

 

When the sample interactions pass through my definition the result will be:

The first interaction has been given two tags because ‘iPhone’ is included in both the iPhone and iOS tags. Whereas the second interaction only matches the Blackberry tag.

When this data arrives at my application it is decorated with clean metadata. I can inspect the tags array and easily apply business rules rather than have to perform text processing.

Of course, I could extend my definition to cover many more data sources and devices, but regardless of the complexity CSDL gives us the power to classify interactions and deliver structured data to applications.

And That’s Just The Start!

My example covered just one scenario where tagging can be extremely effective and efficient.
Our latest release takes tagging to the next level allowing you to tag and numerically score interactions, and to build reusable tag taxonomies fit for complex use cases. I’ll be explaining these new features in detail in my next few posts, so watch this space. 

For inspiration check out our library of tag definitions. You can import these tags into your streams right now. And for full details of the features see our technical documentation.

If you’re new to DataSift, what’s stopping you? Register now and experience the power of our platform for yourself!!

Richard Caudle's picture
Richard Caudle
Updated on Monday, 16 December, 2013 - 12:23

Today we announced the arrival of DataSift VEDO. In this post I’ll outline what this means to you as a developer or analyst.

DataSift VEDO gives you a robust solution to add structure to social data, solving one of the common challenges when working with unstructured ‘big data’. VEDO lets you define rules to classify data so that it fits your business model. The data delivered to your application needs less post-processing and is much easier to work with. The new features will save you time and give you a load more possibilities for your social data.

Data Is Meaningless Without Structure

When working with big data such as social content, one challenge you will always need to tackle is giving unstructured data meaningful structure. If you’re working with our platform currently, you will no doubt be extracting data to your server and running post-processing rules to organise the data to meet your needs.

Processing unstructured data is expensive and not much fun, but it’s where we excel. VEDO lets you offload processing onto our platform. You can now use CSDL (the same language you use for filtering) to add custom metadata labels and scores to data specifically for your use case.

Introducing Tagging And Scoring

VEDO introduces new features which let you attach this metadata, these are tagging and scoring.

Tagging allows you to categorize interactions to match your business model. Any interaction that matches a tagging rule will be given the appropriate text label, serving as a boolean flag to indicate whether an interaction belongs to a category.

Scoring builds on tagging allowing you to attach numerical values to interactions rather than just labels. Scoring allows you to build up a score over many rules, and allows you to model subtle concepts such as priority, intention and weighting.

As you begin to use tagging and scoring more and more, you will want to be able to organise your growing set of rules. To help we have also introduced tag namespaces and reusable tag definitions. Tag namespaces allow you to define taxonomies of tags. You can group tags at any number of levels in namespaces and build deep schemas to fully reflect your model. Reusable tag definitions allow you to perfect your rules and reuse them across any number of streams and projects.

Definition Library

Tagging and scoring are powerful features, but at this point you might not have grasped exactly how they can help you. Therefore alongside the tagging features we’ve also introduced a library of definitions to get you started. Some definitions you can use immediately in your streams (and benefit from our experience), and some serve as example definitions to show you what is now possible.

For example, we have definitions that help you score content for quality (such as how likely is the content a job advert?) and make it easier to exclude spam. On the other hand we have an example definition that shows how you can use the new features to classify conversations for customer service teams, picking out rants, raves and enquiries.

You can view the library here.

There’s More...

Although tagging is the main theme of the new release, there is an awful lot more happening here at DataSift. Alongside the release of VEDO we’re giving you more power, more connectivity and a wider range of sources to play with.

For instance we’ve just introduced delivery destinations for MySQL and PostgreSQL. These new destinations allow you to map your filtered data directly to a tabular schema and have it pushed directly into your database.

We’re also in the process of bringing many more sources onboard (you may have seen our recent announcements!), including many asian social networks.

Look out for improvements to help you work with a wider variety of languages, updates to our developer tools and client libraries, and much much more. I’ll cover these all soon.

Watch this space

In summary there’s far too much to cover in detail here. So watch this space, as over the coming weeks I’ll cover every feature of the new release in depth, with worked examples and sample code so you can take advantage of all these new powers for yourself.

If you can’t wait, all of these new features are fully documented in our Documentation area. Again, check out the new library for inspiration.

If you’re new to DataSift, what’s stopping you? Register now and experience the power of our platform for yourself!!

Jason's picture
Jason
Updated on Friday, 8 November, 2013 - 17:39

On December 2nd, 2013, we plan to remove the "volume_info" field from the DataSift Historics API call response. Please ensure that your application does not expect to receive this field from Historics API calls by this date.

If you are using one of the official DataSift API client libraries, support for this has already been implemented in the following versions of the libraries:

  • Java - 2.2.1+
  • Python - 0.5.4+
  • Ruby - 2.0.3+
  • PHP - 2.1.4+
  • .NET - 0.5.0+

Pages

Subscribe to Datasift Documentation Blog