Open Data Processing for Twitter - Now Available

Richard Caudle | 7th July 2015

Following Twitter's announcement to terminate their partnership with DataSift, we've been working hard to build a solution to help out customers continue to use features they have come to rely on for their Twitter solution.

DataSift's open-source connector

As of August 13th 2015 companies will need to license Twitter data directly from Gnip.

To help make your transition as smooth as possible, and to help you to continue to use our platform features; today we've released an open-source connector that takes data from the Gnip API and delivers the data into our platform.

The new “DataSift Connector” is:

  • Pre-configured and easy to deploy to your infrastructure or common hosting platforms, requiring minimal setup
  • Highly configurable, allowing you to tailor it for your exact solution
  • Supports tweets, re-tweets and deletes

When data from Gnip is ingested the DataSift platform transforms the schema to match the Twitter schema you've been working with until now, so that minimal changes (if any) are required for your existing filters and classifiers.

Note the connector doesn't currently support historic data integration. We are looking to build this support very soon.

How does it work?

The connector is a collection of components that you can configure, tailor, use all or a subset of to bring Gnip data into our platform. We do not host these components for you, but we have made it as easy as possible for you to host components yourself or on a hosting platform such as AWS.

The key components of the connector are:

  • Gnip reader - Connects to the Gnip Powertrack API, receiving streaming data and adds data to the buffer
  • Buffer - Buffers data between the reader and writer, handling bursts of data and ensuring no data loss
  • DataSift writer - Connects to the DataSift API uploading data from the buffer into our platform

Out-of-the-box we provide configurations that combine these components to give you an immediate solution. You can of course take our code and modify it for your needs and take and leave each component as you see fit.

For instance you might want to modify the code to write to your own metrics collector. You can also replace the Gnip Reader with your own reader. This is useful if you are storing the Gnip data in mongoDB or similar - all you’d have to do is to build a mongoDB Reader and ensure it sends data to the buffer.

The connector has been designed this way as it forms part of future roadmap for allowing you to ingest any data into our platform. We'll talk more about this very soon!

Using the connector in your solution

We've invested time in making the connector easy to test, configure and deploy to your choice of infrastructure.

For testing and deployment the solution includes:

  • Packer configurations - Deploy the solution using the Packer configurations we supply to build EC2 AMI and Docker images.
  • Vagrant support - We supply a vagrantfile so that you can clone the code repository locally and spin up a local instance for testing.

Of course you'll want to monitor the connector when you have it running in production. To help you do so the solution includes:

  • Metrics - The solution logs metrics to statsd which you can easily integrate into your infrastructure
  • Grafana dashboard - A default dashboard configuration for monitoring key metrics

To allow you to run a robust production solution the DataSift ingestion endpoint deduplicates incoming data. This allows you to run multiple instances of the connector giving you redundancy.

You might consider running an instance of the connector which dumps data to the buffer, then writing the data from the buffer into a data store for safe keeping.

Using ingested Twitter data in filters and classifiers

When you ingest data from Gnip it will be transformed to match the current Twitter schema you've been using in your filters to date.

The only difference you'll see is that the interaction.type target will be set to 'twitter_gnip'.

So if you have filters specifically filtering for Twitter data using this target you'll need to update your filter to instead say:

interaction.type == "twitter_gnip"

Twitter data from Gnip does include additional fields that aren't included in the DataSift Twitter schema. These fields are maintained through the processing pipeline. You can find these under the unmapped collection at the top-level of the JSON output for the interaction.

Migration timeline

Hopefully the connector can go a long way to completing your migration.

We recommend you get started testing the new connector right away and get in touch with your feedback. You can test the connector and use all of our platform features. Remember though the ingestion endpoint is currently rate limited to allow ingestion of 37MB per hour, enough to ingest 18,000 Tweets an hour (or 5 a second).

Early July we will be raising the rate limits so you can test the solution with production data volumes.

Before the 13th August you will need to have switched to Gnip to receive Twitter data.

Getting started

You can get started with the connector immediately. The code for the connector is available on GitHub here.

To enable access to the ingestion endpoint get in touch with your sales representative.

Previous post: DataSift PYLON - The value of unified data processing

Next post: Building Better Machine Learned Classifiers Faster with Active Learning