Introducing Scoring - Attach Confidence, Quality And Rank To Your Social Data

Richard Caudle | 18th December 2013

The launch of DataSift VEDO introduced new features to allow you to add structure to social data. In my last posts I introduced you to tags and tag namespaces and explained how you can use these features to categorise data before it reaches your application.

Alongside tagging our latest release also introduces scoring, which I’ll cover today. Scoring builds upon tagging by allowing you to attach relative numeric values to interactions. Scoring allows you to reflect the subtleties of quality, priority and confidence, opening up a whole raft of new use cases for your social data.

What Is Scoring?

Whereas tags allow you to attach a metadata label, based on a boolean condition, scoring allows you to build up a numerical score for an interaction over a set of rules. Tags are perfect for classifying into a taxonomy, whereas scoring gives you the power to rate or qualify interactions on a relative scale. Just like tagging, you can apply scores using the full power of CSDL, and save yourself post-processing in your application.

An Example - Identifying Bots

Often our customers look to identify spam in their data to improve analysis. What is considered to be spam is extremely dependent on the use case, and even on the specific question being asked. Here I’m going to rate tweets for how likely they are to be from a bot.

Using scoring I can give each tweet a relative confidence level for suspected bot content. Rather than just tagging a tweet, scoring allows an end application to further the filter on the relative score.

Let’s look at two (stripped down) tweets, the first likely to be a real person, and the second a likely bot:

{
  "interaction": {
    "content": " http://videos.com/funny-cat",
    "source": "Twitter for Android"
  },
  "twitter": {
    "user": {
        "created_at": "Wed, 31 Jul 2012 13:01:22 +0000", 
        "statuses_count": 23, 
        "friends_count": 132, 
        "follower_count": 523 
      }
  }
}

{
  "interaction": {
    "content": "Look at this great product I found http://product.com/super-car",
    "source": "twittbot.net"
  },
   "twitter": {
      "user": {
          "created_at": "Tue, 16 Dec 2013 17:55:36 +0000", 
          "statuses_count": 12, 
          "friends_count": 1098, 
          "follower_count": 31 
      }
  }
}

There are a number of useful metadata properties I can work with to build my rules:

  • interaction.source - For Twitter this is the application that made the post. Here we can look for well-known bot services.
  • twitter.user.created_at - When the profile was created. A recently created profile is likely to be a bot. (Note that the platform calculates a new target called profile_age inferred from the created_at date, which makes our lives easier in CSDL.)
  • twitter.user.statuses_count - How many statuses the user has posted. A bot account is likely to have been created recently and sent few messages.
  • twitter.user.follower_count - The number of users that follow this user. A bot account will often follow many people but not be followed by many users itself.
  • twitter.user.friends_count - The number of people the user follows. (Again the platform gives us extra help by using follower_count and friends_count to calculate a new property called follower_ratio.)
  • twitter.user.verified - A value of 1 tells us that the account is verified by Twitter, so definitely not a bot!

Any one of these qualities hints that the content could be from a bot. The beauty of scoring though is that if we see more than one of these qualities then we can show that we’re more and more confident the tweet is from a bot.

Scoring allows us to build up a final score over multiple rules. For every rule that matches the rule score is added or subtracted.

You use the tag keyword to declare scoring rules, with the following syntax:

tag.[namespace].[tag name] +/-[score]  { // CSDL to match interactions }

Any interaction that matches the CSDL in the brackets will have the declared score added or subtracted.

Let’s carry on the example to make this a bit clearer. Notice how I have multiple rules which contribute different scores to the same tag:

// Assume not spam if account is verified by Twitter
tag.quality.suspected_bot -100 { twitter.user.verified exists and twitter.user.verified == 1 }

// Suspect spam if user profile has been recently created
tag.quality.suspected_bot +25 { twitter.user.profile_age < 5 }
tag.quality.suspected_bot +50 { twitter.user.profile_age < 1 }

// Suspect spam if user has made few tweets ever
tag.quality.suspected_bot +10 { twitter.user.statuses_count < 25 }
tag.quality.suspected_bot +15 { twitter.user.statuses_count < 10 }

// Suspect spam if user has many less followers than they themselves follow
tag.quality.suspected_bot +50 { twitter.user.follower_ratio < 0.001 }
tag.quality.suspected_bot +25 { twitter.user.follower_ratio < 0.01 }
tag.quality.suspected_bot -50 { twitter.user.follower_ratio > 10 }

// Suspect spam if source is well-known bot friendly application
tag.quality.suspected_bot +20 { interaction.source contains_any "twittbot.net,EasyBotter,hellotxt.com,dlvr.it" }

When the rules are applied to the two tweets the output is:

{
  "interaction": {
    "content": "http://videos.com/funny-cat",
    "source": "Twitter for Android",
    "tag_tree": { 
      "quality": { 
      "suspected_bot": 10 
    }
  },
  "twitter": {
      "user": {
          "created_at": "Wed, 31 Jul 2012 13:01:22 +0000", 
          "statuses_count": 23, 
          "friends_count": 132, 
          "follower_count": 523 
      }
  }
}

{
  "interaction": {
    "content": "Look at this great product I found http://product.com/super-car",
    "source": "twittbot.net",
    "tag_tree": { 
      "quality": { 
        "suspected_bot": 80 
      }
    }
  },
  "twitter": {
      "user": {
          "created_at": "Tue, 16 Dec 2013 17:55:36 +0000", 
          "statuses_count": 12, 
          "friends_count": 1098, 
          "follower_count": 31 
      }
  }
}

For the first interaction the total is 10 as it only matches this rule relating to the number of statuses posted:

tag.quality.suspected_bot +10 { twitter.user.statuses_count < 25 }

For the second interaction the total is 80 because the following rules all match:

tag.quality.suspected_bot +25 { twitter.user.profile_age < 5 }  
tag.quality.suspected_bot +10 { twitter.user.statuses_count < 25 }  
tag.quality.suspected_bot +25 { twitter.user.follower_ratio < 0.01 }  
tag.quality.suspected_bot +20 { interaction.source contains_any "twittbot.net,EasyBotter,hellotxt.com,dlvr.it" }

So our relative scores tell us the first interaction is unlikely to be from a bot, whereas the second is highly likely to be so.

When the data reaches my application I can choose how strict I’d like to be about spam. I might say the score must be 0 or below to be absolutely sure the content is not from a bot, or I might perhaps choose a value of 25 say to include content unlikely to be from a bot. The power is given to the end application or user to make the choice using a simple scale.

But Wait, There’s More...

My example covered just one scenario where scoring can be helpful, but there are many more. For example:

  • Measuring purchase intent, rating leads
  • Measuring sentiment and emotion
  • Rating content authors’ authority
  • Identifying influencers

For inspiration check out our library of tag definitions. For full details of the features see our technical documentation.

In my next post I’ll be explaining how you can reuse all the hard work you’re putting into your tagging and scoring rules by importing these definitions into any number of streams.


Previous post: Keeping Tags Organised Using Namespaces

Next post: Build Reusable Tagging And Scoring Rules To Use Across Your Projects