Stewart Townsend | 7th December 2011
The technology behind the platform
In addition to the existing live and buffered streaming, DataSift can now record data to its massive storage cluster. The core of the recording platform is an HBase cluster of more than 30 nodes, with over 400TB of storage. Every piece of information is replicated three times for high availability and disaster recovery.
The same infrastructure is used to record the entire Twitter Firehose, along with the other input sources and all the augmentations (including trends, sentiment analysis, Klout authority score, and gender demographics).
Recording the raw Firehose (250 million Tweets daily) would probably require an entire hard drive every day so the data is compressed and decompressed on the fly using a highly efficient compression library developed at Google for heavy workloads and high-speed compression.
Communication between the website/API and the storage cluster is accomplished using several languages (Java, Scala, PHP via Thrift), and the cluster is continuously monitored, using metrics to dynamically adapt the workload for maximum efficiency.
Internally, a standalone version of the filtering platform runs on each Hadoop node in parallel, effectively analyzing billions of records against several filters at once. The data is transformed, filtered, checked against each user's license and output mask, and then emitted as a new recording, that can be exported (for example, to Amazon S3).
The platform had to be partially re-engineered to work as an embeddable library instead of a standalone service.
Moving data around at this volume is not an option, so everything has to be local, the analysis happening where the data is already stored. This is the opposite of what's been done so far in most data centers where data is moved to the processing unit.
Data is uniformly distributed across different servers on several data nodes. When a request to filter historical data is received, hundreds of parallel tasks commence, and each of them filters one data node within the selected time range and for the chosen data sources. Thanks to the nature of Hadoop and map-reduce, everything is performed in parallel, and the results of each unit are then collated into a continuous recording in a subsequent step.
What you receive
The full historical data is available for post processing so, for instance, it's possible to apply a filter on the entire Twitter Firehose, or on Digg or MySpace, for the past month. This feature is particularly valuable when the need arises to analyze trends after they happened. There are many potential use cases including analyzing the response to an ad campaign or looking for correlations between unanticipated events and the social media comments that followed.
The DataSift recording interface is very simple and abstracts all the internal complexity. You can select the data sources, the time range, and the filter you want to apply, and you'll be notified when the data is ready to be exported. You don't need to worry about anything else.
- 30-node Hadoop cluster
- 180 hard drives
- Storing the entire Twitter Firehose of 250+ million Tweets per day
- 500 GB of compressed data per day
- MySQL (Percona server) on SSD drives
- HBase cluster (currently ~30 Hadoop nodes, 400TB of storage)