Blog posts in Announcements

Ed Stenson's picture
Ed Stenson
Updated on Thursday, 15 November, 2012 - 12:47

Today, Datasift announces two new link resolution services. First, we're delighted to be partnering with bitly, the #1 link sharing platform, which powers 75 percent of the world’s largest media companies and half of the Fortune 500 companies. With over 20,000 white-labeled domains, bitly generates 200M clicks/day.

In addition, DataSift's own Links augmentation is now live, too. This is a massive update to our old TweetMeme links aggregator. Until now, we used TweetMeme to fully resolve links embedded in Tweets and other data sources but, with the increased volume of links traffic coming from Twitter, we were close to hitting TweetMeme’s maximum capacity of 40M links/day.

What's the big deal?

These two services are complementary. By definition, 100 percent of bitly interactions relate to links and clicks on links. In fact, the data volume can be so high that, you might want to add an extra line of CSDL to throttle back the volume; here's how you do it:

But adding the Links augmentation provides even more opportunities for filtering. For example, here's some CSDL that filters in real time for clicks made within the UK on bitly links that lead to content with Apple in the title.

On top of that, DataSift's Links augmentation adds metadata to the interactions in your filters. I discuss the significance of metadata in another blog, Open Graph and Twitter Cards.

Targets

In DataSift, there are currently 15 targets that you can filter against. For example:

  • bitly.cname allows you to filter against custom names in bitly such as es.pn (for the ESPN sports network) or nyti.ms (for the New York Times).
     
  • bitly.referring_domain allows you to filter for clicks on links from particular domains; that is, links embedded on pages at domains you specify.
     
  • bitly.country_code allows you to filter for clicks on bitly links from countries you specify.

Meanwhile, our Links augmentation offers 79 targets that you can filter against. For example:

 

Use cases

Trend analysis

A very compelling use case for bitly is trend analysis. We can already track Likes on Facebook or Retweets on Twitter and expect to see when something goes viral. But what if a story receives relatively little of this kind of attention but large numbers of clicks? To monitor clicks activity rather than sharing activity, bitly and the Links augmentation are perfect. For an all-round perspective, you could monitor bitly, the Links augmentation, Twitter retweets, and Facebook likes simultaneously.

Platform analysis

The bitly.user.agent target is useful when you want to measure popularity of a particular web client. Find out which of your content is most popular on mobile devices.

Geo analysis

Find out what percentage of people publishing or sharing information about a particular subject are located in a specified area. Take it one stage further and approximate the size of the area that people are Tweeting from. Find out if people enjoyed a rock concert, or to determine how quickly a wildfire is spreading.

Timezone analysis

The timezone for a click helps you find out when content is published or shared in different time zones. Use it to compare the kind of content popular in the morning in the US and mainland Europe.

Summary

Looking at the combined stats for bitly and the Links augmentation, DataSift resolves an average of 3,500 links per second, collecting the metadata at the same time and caching the results.

To learn more about these services and the engineering that makes them work, take a look at today's blog by Lorenzo Alberton, DataSift's Chief Technical Architect.

Ed Stenson's picture
Ed Stenson
Updated on Thursday, 15 November, 2012 - 16:28

Today we announce our new data source, the bitly input stream. With 200M clicks a day, it provides an excellent augmentation to the links embedded in Tweets and in messages from other data sources. In the past, we could see which content was being shared; we can now see which links are actually being clicked. In practical terms, DataSift can now reveal activity that, formerly, was hidden. We can show the real reach of content like never before, providing the complete picture, not just one side of it.

The new data source is unquestionably useful by itself, but here at DataSift we’re always trying find ways to add more information to the input data, making it richer and more structured. Every time our filtering engine sees an interaction that contains a link, it resolves that link all the way back to the interaction's original target page, even if the link has been shortened several times. Then, it examines the target page, looking for metadata in Open Graph or Twitter Cards format in the page's HTML header. Any metadata that it finds, it adds to the interaction. We believe the result makes the click stream ten times more valuable to our users, so let’s explore in more depth the data that our platform can deliver.

Data, metadata, and embedded content

A simple but immensely significant change has arrived in the world of social media, as two apparently separate elements, embedded content and metadata, have come together in a fascinating way. At DataSift, the effect is already impacting about 30 percent of the content passing through our servers, and the trend shows healthy growth.

What are Open Graph and Twitter Cards?

Let's define our terms first; embedded content on a web page consists, for example, of videos or static images such as photographs.

Meanwhile, metadata is nothing more than data that describes other data. If that data happens to be a piece of embedded content, a press photograph, for instance, it might be accompanied by metadata such as:

  • a title
  • a description
  • a URL that points to the photograph
  • the width and height of the photograph

... plus as many other nuggets of relevant information that the image's creator chose to supply.

A key technology here is Facebook’s Open Graph protocol (now an open standard that anyone can use), and more recently Twitter Cards. Given the volume of content being shared on Facebook and Twitter, these two platforms decided to propose a set of metadata properties that content creators could use to influence the way their content is previewed (“embedded”) when shared on Facebook and Twitter.

As an example, the New York Times (one of the over 2,000 newspapers already using OpenGraph and Twitter Card metadata) might specify - for each article - the title, the description, the author, the canonical URL, and what image should be used in the preview on Twitter/Facebook.

Why are Open Graph and Twitter Cards significant?

Open Graph and Twitter Cards allow Facebook and Twitter to present rich content, and these ideas are producing an extraordinary, explosive effect because they benefit so many participants in social media:

  • Creators benefit because they now have a way to determine what happens to their content after release. By defining metadata for any creation, whether it's a 3,000-word blog, a photograph, a video, an audio clip, or something brand new on the web, creators can name, annotate, and classify their work.
  • Syndicators benefit because metadata makes their lives easier. In the old days, a newspaper article about Hewlett-Packard stock might have discussed $HPQ common stock but it might have been about inventory shortages of the latest HP server. The only way to be sure was to read the article, or to use natural-language processing to analyze it. But metadata takes the problem away. To describe the article, the syndicator can simply republish the description that they find in the metadata. If it comes from a trusted creator, it will be good. The quality and amount of metadata can be impressive, and span classification, summary, domain, author information, etc.
     
  • Consumers benefit from metadata because they get a better experience on Twitter and Facebook, by having a compelling, visual preview of the target page embedded in their timeline, and not just a link, so they can immediately make up their mind whether it’s worth following the link to the full article or not.

 

Statistics

According to our statistics, more than one-third of all the links we receive point to a page with Open Graph metadata, and about 10 percent also have Twitter Cards (it’s a lower percentage because Twitter Cards is a younger protocol and less generic), so a really significant portion of the links will contain a wealth of information attached.

 

Examples
 

Twitter Card

Facebook Open Graph

 

 

DataSift

We believe that we're the only company able to filter against Open Graph and Twitter Card data, offering you an opportunity to gain unique insight. Here are a few possible use cases:

  • In real time, monitor clicks on bitly links to your site or check out bitly links going to your competitors' sites.
     
  • For stories about TV shows featured on America's top-five newspapers websites, which ones are shared in links the most?
     
  • For Tweets that were heavily retweeted, filter for those that contained heavily clicked links.
     
  • For stories on Superbowl Sunday, exclude the ones that do not have Google News keyword metadata. Stories with Google News keywords will be amongst those most widely read.

 

Stewart Townsend's picture
Stewart Townsend
Updated on Friday, 5 April, 2013 - 15:34

The technology behind the platform

In addition to the existing live and buffered streaming, DataSift can now record data to its massive storage cluster. The core of the recording platform is an HBase cluster of more than 30 nodes, with over 400TB of storage. Every piece of information is replicated three times for high availability and disaster recovery.

The same infrastructure is used to record the entire Twitter Firehose, along with the other input sources and all the augmentations (including trends, sentiment analysis, Klout authority score, and gender demographics).

Recording the raw Firehose (250 million Tweets daily) would probably require an entire hard drive every day so the data is compressed and decompressed on the fly using a highly efficient compression library developed at Google for heavy workloads and high-speed compression.

 

Communication between the website/API and the storage cluster is accomplished using several languages (Java, Scala, PHP via Thrift), and the cluster is continuously monitored, using metrics to dynamically adapt the workload for maximum efficiency.
 
Internally, a standalone version of the filtering platform runs on each Hadoop node in parallel, effectively analyzing billions of records against several filters at once. The data is transformed, filtered, checked against each user's license and output mask, and then emitted as a new recording, that can be exported (for example, to Amazon S3).
 
The platform had to be partially re-engineered to work as an embeddable library instead of a standalone service.
 
Moving data around at this volume is not an option, so everything has to be local, the analysis happening where the data is already stored. This is the opposite of what's been done so far in most data centers where data is moved to the processing unit.
 
 
 
As this diagram shows, data is uniformly distributed across different servers on several data nodes. When a request to filter historical data is received, hundreds of parallel tasks commence, and each of them filters one data node within the selected time range and for the chosen data sources. Thanks to the nature of Hadoop and map-reduce, everything is performed in parallel, and the results of each unit are then collated into a continuous recording in a subsequent step.

What you receive

The full historical data is available for post processing so, for instance, it's possible to apply a filter on the entire Twitter Firehose, or on Digg or MySpace, for the past month. This feature is particularly valuable when the need arises to analyze trends after they happened. There are many potential use cases including analyzing the response to an ad campaign or looking for correlations between unanticipated events and the social media comments that followed.

The DataSift recording interface is very simple and abstracts all the internal complexity. You can select the data sources, the time range, and the filter you want to apply, and you'll be notified when the data is ready to be exported. You don't need to worry about anything else.

Stats

  • 30-node Hadoop cluster
  • 180 hard drives
  • Storing the entire Twitter Firehose of 250+ million Tweets per day
  • 500 GB of compressed data per day

Data Stores

  • MySQL (Percona server) on SSD drives
  • HBase cluster (currently ~30 Hadoop nodes, 400TB of storage)



Get on the list for early access to Twitter Historical Data.