Introducing : Links Augmentations

Stewart Townsend | 15th November 2011

One of our favourite features of DataSift is our Links Augmentation. In short, it is used to fully resolve any URL to it's original, un-shortened form, allowing us to fetch content from the page the link points to. On top of resolving the link we also fetch the content of the page and allow you to filter against that as well. And most importantly we do both of these in real-time allowing you to filter against the results.

image

How is it used?

We'll start with an example. Lets say I'm Tweeting about the shiny new headphones I just bought on Amazon, and I shorten their Amazon.co.uk URL using Bit.ly. Twitter also shortens my Bit.ly URL further into at.co short URL.

So, our URL trail looks something like this :

http://t.co/rAK7GUdn -> http://amzn.to/bH2hmP -> http://www.amazon.co.uk/gp/product/B003LPTAYI/

What DataSift does, is fully resolve the t.co link, so we can see what the link is pointing to, and query the content there. We could do a number of things with this data, one of the most useful would be checking to make sure the link is relevant to what you have written a query for.

Imagine we had written a CSDL query to look for posts that contain links and "shiny headphones" :

interaction.content contains_near "shiny,headphones:10" AND links.url exists

This will bring back any posts with a link, also containing "shiny" and "headphones" within 10 words of each other. It won't however, give us any indication as to what is contained at the end of the link.

To increase the chances of the link returning a useful result, we could change the CSDL to :

interaction.content contains_near "shiny, headphones:10" AND links.title contains "headphones"

This will look at the title of the fully resolved link, and see if it contains the word "headphones"

What makes it so useful?

Where do we start…..

DataSift makes searching shortened URLs easier and more powerful than ever, with almost endless possibilities of the amount of information you can gather about a link without even opening it.

What if we want to find the latest news stories about the NHL on ESPN? Simple :

links.domain IN "espn.com,espn.go.com" AND links.title CONTAINS "NHL" AND links.age < 172800

This will bring back any posts containing a link that finally resolves at espn.com, a final page title that contains the string "NHL", and that was (at the time of writing) created in the last two days.


Previous post: Introduction to Tagging

Next post: Introduction Streaming API