Split-Second Social Media Analysis with DataSift and Redis

Jacek Artymiak | 25th February 2013

Social media gives us a way to sample trends and sentiment in real time. Consequently, it is very important that the analysis of the data we are looking at also happens in real time. And we want to help you, because here at DataSift we want our platform to be the Swiss Army knife of the social media analysis tools. We try to be flexible and do as much of the hard work as possible so that you can focus on analyzing the data instead of having to think how to feed it into your processing pipeline.

We strive to achieve that goal with our advanced Push data delivery system and its ever-growing set of connectors that can deliver the data you filter for to a variety of destinations. These could be third-party cloud storage services, such as Amazon AWS DynamoDB or an instance of CouchDB running on your own server. If there is a way to connect to a machine via the internet, we want to be able to deliver data to it.

When time is of essence and you absolutely must be able to start analysing data as soon as you receive it, then keeping data in RAM will help you shorten the time needed to access and process it. One popular tool for managing data in memory is Redis, an Open Source key-value store. And today we are very happy to announce the immediate availability of our new Redis connector, which will deliver the data you filter for straight to your Redis instance.

Getting started with Redis

It is your responsibility to set up your own instance of Redis and make sure it can be reached via the internet. If you have never used Redis before, we have help to get you started. Then it is just a matter of setting up a subscription via our Push API. The data will then be delivered straight into your Redis server ready for processing.

At your end, you will need a way to connect to your Redis server and you can do that with one of Redis clients. Many are available and you should be able to find the one that fits your needs quite easily.

The client alone is just a part of equation. You will also need software that can unpack interactions you get from DataSift from JSON into another format and look for the answers to your questions. Just like Redis, JSON is very well supported and many programing languages include appropriate libraries by default. As for data analysis tools, you will be the best judge of their usefulness, and it always is a good idea to ask your community for suggestions when you are not sure.

Please remember that you will be more likely to get reliable results if you start your analysis with a well-defined data set. That is where a well-written set of CSDL filters can help you pick out the most relevant interactions for further processing.

Those pesky limits (and how to cheat around them)

Keeping data in RAM lets you avoid delays caused by slow disk read and write operations, but that convenience comes at a cost: RAM is volatile and usually not available in large quantities even on high-end servers. It is also expensive to buy. Fortunately, you can architect a solution that reads data from a Redis store and saves it to disk, you can also rent servers with 110GB of RAM or more on a hourly basis, which can be a very cost-effective alternative to buying them or leasing on a long-term contract. Amazon AWS EC2 High-Memory instances are one such solution.

The issue of volatility is important when you do not want to lose data. You can avoid problems by storing multiple copies of data on two or more servers either by replicating it yourself or by creating two or more Push subscriptions based on the same stream hash. You can also make backups of the data held in memory to disk.

And if you really lose data you can retrieve it again using Historics. There will be a delay in receiving data, which may render it no longer relevant, but please keep in mind that there is a way to "replay" your analysis albeit at additional cost of running a Historics query.

RAM size constraints are also fairly easy to overcome. If the data you want to analyze does not fit inside the physical memory installed in your machine, you will need to add RAM, get a machine with more RAM, or use a piece of software that can manage a farm of Redis servers, such as the Redis LightCloud manager.

Go mad!

If your social media analysis business needs to work in real-time our Redis connector is the tool that will help you get further ahead of your competition. Go mad, build something amazing, and let us know how else we could be helping you achieve your goals!

This post was written by Jacek Artymiak with valuable input from Ollie Parsley, the developer of the Redis Push Connector.

Previous post: The Query Builder

Next post: Writing CSDL in Vim