Lorenzo Alberton | 14th November 2012
Bitly, the #1 link sharing platform, powering 75 percent of the world’s largest media companies and half of the Fortune 500 companies, with over 20,000 white-labeled domains, generates a lot of traffic: 80M new links a day and 200M clicks/day. In other words, on average, over 2,000 people click on a bitly link every second.
When we started exploring a partnership with bitly, we had to do some capacity planning to support their stream in our DataSift platform. We also took the opportunity to revise how we ingest high-volume streams and how we resolve links.
Firehose? More like raging rivers!
Luckily, we already had custom-built libraries to handle huge HTTP streams, which are split from one pipe into many more-manageable sub-streams, sliced and diced into single messages and fed into high-performance, persistent queues, ready to be processed by the rest of the platform. We learnt a lot from our experience with handling the Twitter Firehose, so it was just a matter of provisioning a few more machines and configuring the new “pipes”.
Links Resolution Architecture
Given the bitly stream is 100% represented by links data, we also got very excited by how we could use them. Until now, DataSift has been using the API of its own predecessor, TweetMeme, to fully resolve links embedded in Tweets and other data sources, but with the increased volume of links traffic coming from Twitter, we were close to hitting TweetMeme’s maximum capacity of 40M links/day.
What better opportunity to rebuild our links resolution service from the ground up? Now, resolving hundreds of links every second, with predictable latency and accuracy, is no easy task: 5 years of experience with TweetMeme proved to be an invaluable starting point. Taking it to the next level was a matter of redesigning the whole pipeline, keeping all the good parts and recoding where necessary.
The resulting architecture is summarized in the following diagram:
Every new link is cleaned and normalized, looked up in our cache, and immediately returned if found. If not, we try to follow a single hop, and we repeat the procedure until we reach the target page. Every hop is cleaned, normalized and cached, so we can resolve links sharing part of the chain much faster. At every hop, we evaluate the domain and we select specialized URL resolvers, which allow us to rewrite URLs on the fly and/or improve our cache hit ratio. We have stampede-protection systems in place, and watchdogs whose goal is to guarantee the lowest possible latency. Thanks to the new architecture, tens of workers can now resolve links in parallel to spread out the workload across different processes and different servers, so we can keep scaling out as the number of inbound links increases.
With these changes, we managed to reduce the (already low) failure rate, and vastly increase the number of successfully resolved links, instantly quadrupling the throughput on the same number of nodes. At the time of writing, we’re successfully resolving over 200M links/day.
Metadata > Data
The real power of DataSift is to tame huge volumes of traffic and to go well beyond what we receive from our data source, by augmenting each and every message with extra metadata, to add context, insights, and real quality to our inputs. So it was a natural evolution to extend the new and more powerful link resolution engine with extra features. We started with extracting the title of the target page, the language, the content type (we’re now able to resolve links to images and videos!), description, and keywords. These metadata fields are already adding a lot of value to the original message, which might not have any context otherwise. However, we didn’t stop there.
We decided to extract OpenGraph data, Twitter Cards data, and Google News keywords from every page too: almost 100 new properties about the page, the publication, the author, images and videos associated with a post, classification information, description, Twitter handles, information about entities (companies, people, places) mentioned, and so on. Enough information to make any data scientist drool. When we showed a preview to our internal data science team, they jumped around, happy for the early Christmas gifts.
Clicks + page metadata = Information heaven
By combining information about each click (user agent, geo location, domains, for example) and rich metadata about the target page, the number and quality of insights about what people visit and talk about has reached an entirely new level, and can really transform the meaning of social data mining. And with DataSift, accessing all this information has never been easier.