Getting Started

DataSift is the complete real-time social media solution for enterprises looking for comprehensive data that delivers actionable insight.

DataSift is Big Data. We have 100% of the Twitter Firehose always available in real time.

Our media curation platform is cloud-based and highly scalable. It delivers a variety of metadata including enriched augmentations such as geo location, social influence, and sentiment analysis in one place, so you will never miss a thing. Delivered through our APIs or through our extensive analytics partnerships, our flexible pricing means that whether you're a large enterprise or a single developer you can get access to the data that you need.

Aggregating, Filtering, and Analyzing

DataSift filters for information as it is posted. For example, you could filter for:

  • any mention of an individual.
  • any message from a particular social media site.
  • any message sent within 25 miles of the 10 largest U.S. cities.

You can aggregate data to monitor streams of messages from more than one social media site simultaneously, or you can exclude individual sites.

We augment our real-time streams with third-party solutions such as Peer Index, Klout, and Lexalytics. Then our users can filter that data or have it delivered to them.

Here's a full list of the information your filters can target.

Historics

You can also filter against the DataSift Historics archive, a large body of content gathered from a variety of social media sites. Historics is useful when you want to turn the clock back and filter against data from the past.

It uses the same filtering language that you use when you're looking at live data. For Twitter, it works 100 times faster than live streaming and it offers 100% coverage.

REST API

The REST API enables developers to access DataSift's core functionality. Using DataSift's simple, powerful programming language, you can access the social media sites you want to monitor. The REST API provides easy ways to test and compile your code. You can run one or more streams at the same time and export data in real time.

Streaming API

The Streaming API offers data from all our sources in near real time. This API is for those developers building applications that do the heavy lifting, capturing continuous streams of data with no defined end and making sure that nothing is missed. If you're building an application for a major data-mining task, the Streaming API is the place to start. It also allows you to run multiple streams at the same time, filtering for two or more different sets of information simultaneously

Keywords: 
Tweet: 

Things to Look at First


DataSift is very simple to use. Build filters to find the information you want in real time.

1. The filters are written in our programming language. It's called the Curated Stream Definition Language (CSDL).

2. The key feature of CSDL is its set of operators.

3. The filters take information from social media sites and augmentations. Together these are called targets.

4. You can write applications that access DataSift through our APIs.

5. You can test your filters in our user interface before you start to explore the APIs.

6. Please read our Understanding Billing guide and Billing FAQ.

7. Please read Targets vs Output Data to understand the difference between filtering data and consuming data.

 

Tweet: 

Client Libraries: Basics

We've rewritten this page but the earlier version is still available if you need it.

 

Are you trying to set up Push or Historics? You're in the right place. Read this page and then go to Client Library: Push and Historics.

 

The best way to understand how our client libraries work is to take a high-level view to understand the general concepts and then dip down into your language of choice. All the client libraries follow the same basic principals that we describe here. Class and method names may vary slightly from library to library. This document is meant as a general overview of the concepts shared across the different language bindings.

If you need code examples right away:

 

Client Library Details (Beta)

We're gradually building our low-level documentation for each client library. This section is still in beta but you're very welcome to look:

 

Basic objects

User

All interaction with the API libraries begins with an instance of the User class. Most other objects you will create while using the API require you to pass a User object, to which you have given your DataSift login and an API key.

All other objects that use the DataSift API require a User object for authentication. The User class provides methods for most ways you might want to create those other objects.

Definition

A Definition object represents a stream definition. These are roughly equivalent to the streams as you see them on the DataSift web interface, except that you cannot store them in your DataSift account for later retrieval.

Historic

A Historic object represents a Historics query. These are also roughly equivalent to the queries you can create via the DataSift web interface.

StreamConsumer

This object creates and manages connections to the DataSift streaming server. The default connection type used is HTTP, but some of the libraries also support WebSockets. This object is roughly equivalent to the code that runs in the browser when you're receiving data from a stream in the DataSift API.

When you create a StreamConsumer you supply a way to receive events from the stream. Events include connection, interactions, errors, status messages and more.

Exceptions

All of the libraries throw exceptions (or their language's equivalent) when errors occur. It's important that you catch and handle all possible errors for every call you make.

 

Putting it all together

This section is in pseudocode but you can translate it almost directly into your language of choice.

We start by creating a User object.

 

We now have a choice depending on what we actually want to do. We can create a StreamConsumer directly from the User object if we have a stream hash that we want to consume. For the sake of example we're going to say that we don't have the hash, we have some CSDL that we want to use. To do that we need to create a Definition object to represent that CSDL.

 

The first thing we should do is validate that the CSDL we've supplied is valid. We do this using the validate() method on the Definition object. As mentioned above, if the validate() method encounters an error (for example, if compiling the CSDL fails) it will throw an exception, so we make sure to catch that. However, we must also make sure we catch other errors that may occur. The example below uses a generic catch block to handle things like authentication or connectivity problems, but in your production code you should always catch specific exceptions.

 

Exception handling will be omitted from the remainder of this document, but please make sure you are correctly handling all exceptions that may be thrown by the library in your code, otherwise your program may terminate without warning while you're sleeping, and you'll miss out on some of the lovely data you want to consume.

Now that we know our CSDL compiles properly we can move on to creating a context for consuming data via a streaming connection. We start by creating our event handler.

 

For most libraries there is a reference class (or interface) which your event handler must implement. This defines the methods that must exist and the parameters they take. The most notable exception to this is the Ruby library which currently uses blocks rather than an event handler class.

The first method is onConnect() which gets called when a connection is successfully established with the DataSift streaming server.

 

The opposite of this event is getting disconnected. There's an event for that, too.

 

Let's take a moment here to look at the parameter being passed to these two events. All events will get the StreamConsumer object which is raising the event as the first parameter. This enables the handlers to make calls on the StreamConsumer.

The data coming down the streaming connection consists of a mixture of interaction objects, status messages, warnings and errors. We have events that handle each of these.

Status messages trigger the onStatus() method.

 

Status messages can contain additional information and this will be passed in the info HashMap. For example, a status of type "progress" for Historic queries will contain the percentage complete in info['progress'].

Errors and warnings trigger the following methods.

 

The data being received (interactions) trigger one of two methods: onInteraction and onDeleted. Essentially the data passed for a deletion notification is in the same format as normal interactions but only contains the data required for you to identify the interaction that has been deleted so you can delete it from your own storage systems.

 

Note that properly handling delete notification is required for you to remain compliant with some of the licenses you have signed.

Interactions trigger the onInteraction() method.

 

And that completes the EventHandler class.

 

Now that we have an event handler ready to receive events we can get a StreamConsumer from our Definition object.

 

The first parameter is the type of consumer we want. Most of the libraries only support HTTP streaming at this time, but some also support WebSockets. The second parameter is an instance of our EventHandler class.

We can now start to consume data. In most of the libraries this call will not return, so if your program needs to do other things while connected to the stream you'll need to wrap your usage of the API library in a thread.

 

The library will now compile the definition if necessary, connect to the streaming server, and start consuming data.

 

 

MultiStream support

The above discussion focused on consuming a single definition. Most of the libraries support consuming multiple definitions through the same stream connection. When doing that the event handler methods will get passed the hash of the stream which matched the interaction in addition to the other parameters.

Notes

  • Make sure you have a DataSift login and an API key.
  • The libraries will throw exceptions when something goes wrong, so make sure you're catching and properly handling them.
  • For most libraries, once you start the StreamConsumer it will not return until the stream gets disconnected.
  • You can control the StreamConsumer from any of the event handler methods using the StreamConsumer object they are passed.

 

Community-built libraries

We're happy to see that the DataSift development community has already started to add to the set of libraries we provided at launch.

Other code examples

Keywords: 
Tweet: 

Running the Java Examples

Download the Java Client Library

Download a copy of the Java library from our GitHub Java page. There are two ways to do this.

  • In Github, click Download and extract the archive manually
  • Use git from the command line:

    git clone https://www.github.com/datasift/datasift-java.git

Learn to Use the Java Client Library

These steps test the REST API (api.datasift.com) and then the streaming API (stream.datasift.com).

Endpoint Supported?
https://api.datasift.com/validate Yes
https://api.datasift.com/compile Yes
https://api.datasift.com/stream Yes
https://api.datasift.com/dpu Yes
https://api.datasift.com/usage Yes
https://stream.datasift.com Yes
ws://websocket.datasift.com No *

* Note that there are two endpoints for the streaming API:

  • stream.datasift.com for HTTP access
  • websocket.datasift.com for WS access

The DataSift client libraries use HTTP streaming by default for most languages. There is no technical restriction on hitting the websockets API endpoint. The choice of protocol makes no difference to the data objects that are returned.

 

Tweet: 

Saving to a database using PHP

Step 1.  Make sure you look at the Streaming API section of the PHP code example example. It shows how to call the API.

Step 2.  Add your username and API key to the code.

Step 3.  Look at the format of an object from the source you want to work with. Identify the fields that you want to save to a database. 

Step 4.  Create a table in your database. Create columns to represent the data you will store.

Step 5.  Execute this on your mysql server to create the sample table structure used in the example.

The PHP code will extract information from the stream and save it into the newly created table.

Here's some sample SQL code to create the table:

Tweet: 

Running the Node.JS Examples

Download the Node.JS Client Library

Download a copy of the library from our GitHub Node.JS page. There are two ways to do this.

  • In Github, click Download and extract the archive manually
  • Use git from the command line:

    git clone https://github.com/datasift/NodeJS-Consumer.git

Learn to Use the Node.JS Client Library

These steps test the streaming API (stream.datasift.com).

Endpoint Supported?
http://api.datasift.com/validate No
http://api.datasift.com/compile No
http://api.datasift.com/stream No
http://api.datasift.com/dpu No
http://api.datasift.com/usage No
http://stream.datasift.com Yes
ws://websocket.datasift.com No *

* Note that there are two endpoints for the streaming API:

  • stream.datasift.com for HTTP access
  • websocket.datasift.com for WS access

The DataSift client libraries use HTTP streaming by default for most languages. There is no technical restriction on hitting the websockets API endpoint. The choice of protocol makes no difference to the data objects that are returned.

 

Keywords: 
Tweet: 

Running the PHP Examples

Download the PHP Client Library

Download a copy of the library from our GitHub PHP page. There are two ways to do this.

  • In Github, click Download and extract the archive manually
  • Use git from the command line:

    git clone https://www.github.com/datasift/datasift-php.git

Learn to Use the PHP Client Library

These steps test the REST API (api.datasift.com) and then the streaming API (stream.datasift.com).

Endpoint Supported?
http://api.datasift.com/validate Yes
http://api.datasift.com/compile Yes
http://api.datasift.com/stream Yes
http://api.datasift.com/dpu Yes
http://stream.datasift.com/usage Yes
http://stream.datasift.com Yes
ws://websocket.datasift.com No *

* Note that there are two endpoints for the streaming API:

  • stream.datasift.com for HTTP access
  • websocket.datasift.com for WS access

The DataSift client libraries use HTTP streaming by default for most languages. There is no technical restriction on hitting the websockets API endpoint. The choice of protocol makes no difference to the data objects that are returned.

Keywords: 
Tweet: 

Running the Ruby Examples

Download the Ruby Client Library

Download a copy of our Ruby client library from Github. There are two ways to do this.

  • In Github, click Download and extract the archive manually
  • Use git from the command line:

    git clone https://www.github.com/datasift/datasift-ruby.git

Learn to Use the Ruby Client Library

These steps test the REST API (api.datasift.com) and then the streaming API (stream.datasift.com).

Endpoint Supported?
http://api.datasift.com/validate Yes
http://api.datasift.com/compile Yes
http://api.datasift.com/stream Yes
http://api.datasift.com/dpu Yes
http://api.datasift.com/usage Yes
http://stream.datasift.com Yes
ws://websocket.datasift.com No *

* Note that there are two endpoints for the streaming API:

  • stream.datasift.com for HTTP access
  • websocket.datasift.com for WS access

The DataSift client libraries use HTTP streaming by default for most languages. There is no technical restriction on hitting the websockets API endpoint. The choice of protocol makes no difference to the data objects that are returned.

Tweet: 

Data Augmentation

On top of concentrating the raw data of multiple social media channels into a single stream, DataSift adds value to each data object it channels by augmenting it with other useful data. For example the standing of the author may be available, or the sentiment or language of the message. This enables you to write more incisive filters and receive more valuable results.

There are different types of augmentation:

  • the expansion of any links included in the message and title of the pages to which they point
  • results of sentiment or language analysis of the data object's message
  • profile data of author or subject tracked by other providers such as Klout or Trends

DataSift uses data and analysis from the following providers to bring this extra insight:

​Not all data sources receive all types of augmentation. Language, Salience and Links are available for all sources; the other augmentations are currently availabe only for Twitter:

 

 

Keywords: 
Tweet: 

API Authentication

You access DataSift's APIs in two ways:

  • Use HTTP headers
  • Place all the parameters in a GET/POST request

 

HTTP Headers

This is the recommended appropach for production environments.

You need to include a header that contains: "Auth: username:apikey"

Most client libraries already support this authorization system.

 

GET/POST Requests

Accessing a DataSift API endpoint using GET/POST parameters is the simpler approach. It is sometimes useful for testing. Here's an example that calls the REST API to compile a CSDL filter:

    https://api.datasift.com/compile?csdl=<csdl>&username=<username>&api_key=<api_key>

Note that when you hit the Streaming API using the Websocket protocol, you must authenticate in the URI using GET parameters - there is no other option. In this case, the URI format is:

    ws://websocket.datasift.com/<hash>?username=<username>&api_key=<api_key>

Tweet: 

API FAQ

 

Basics

What's an API?

The acronym "API" stands for "Application Programming Interface". An API is a defined way for a program to accomplish a task, usually by retrieving or modifying data. In DataSift's case, we provide API methods for all the core functionality. Programmers use the DataSift API to build applications that work with our platform. Programs talk to the API using HTTP or websockets.

How do I find my API key?

NOTE: your API key and DataSift username are both case sensitive.

To use the DataSift API, the first thing you need is your API key. 

1. Log in to DataSift.

2. Go to the Dashboard or to the Settings page.

3. Click the Copy to Clipboard icon under "Developer API Key".

Note that DataSift does not display the API key until you have purchased credits. If you have no credits yet, you cannot make an API call.

How do I find the hash for my stream?

1. Log in to DataSift.

2. Click on the Streams tab.

3. Select a stream.

4. Click the green "Use stream" button.

What's an interaction?

In DataSift, an interaction is one object from a source website. For example, an interaction from Twitter is a Tweet, including all the meta information such as the name of the author, the number of followers the author has, the number of people the author follows, and so on. It can also include augmentation information such as the language in which the content is written and the sentiment, positive or negative, conveyed in the message. An interaction is typically delivered to your client application as a JSON object.

What's a filter?

Filters sit at the very heart of DataSift's engine. You write them in our CSDL programming language. You can think of a filter as the logic that decides which input objects DataSift will deliver to you and which ones it will discard.

What's a stream?

A stream is the output of a filter. The terms are quite close. In fact, you might hear some developers refer to filters as streams.

What's a target?

A target is an individual field of information supplied by one of our social media partners sich as Twitter, by a third-party augmentation such as Klout, or by additional processing performed by DataSift itself, such as language analysis or gender detection. For example:

  • twitter.text contains the 140 characters of a Tweet
  • klout.score contains an author's overall score on Klout
  • language.tag contains the 2-character language code that identifies the language in which the post of writter

What's a post?

A post is a generic DataSift term for a message from one of our social media partners. Posts can be Tweets, blog entires, blog comments, Myspace content, and so on.

What's CSDL?

CSDL is our programming language, the Curated Stream Definition Language. Every DataSift developer learns CSDL because it is the language you use to write filters. It's a very simple, compact compiled language that runs exceptionally quickly and is easy to learn.

How many streams can I create?

There is currently no limit to the number of streams you can create through our Streaming or REST APIs. You can create a maximum of 1,000 streams through our GUI. For full details on our API usage policy, please see API Rate Limiting.

What's JSON?

JSON (JavaScript Object Notation) is the default format for the data that our APIs return. Wikipedia introduces JSON as "a lightweight text-based open standard designed for human-readable data interchange."

How do I switch from JSON to JSONP?

To use JSONP you need to understand how to:

  • Write a JavaScript callback function
  • Include that function in your API call
  • Make sure you specify JSONP in your API call

You define your JavaScript callback function on your web page and name it as your src parameter when you call a DataSift API. The callback function needs to be able to process standard DataSift objects but also be able to handle the other types of message that it might receive, such as ticks or error messages. A tick simply indicates that a connection is open but receiving no data. Ticks look like this:

    {"tick":1336057708,"status":"initialised","message":"Waiting for data"}

Your API call needs to include the name of the callback function. For example, suppose that you're using this API call with JSON and, to keep things simple, suppose that your username is "me" and your api_key is 888:

    http://stream.datasift.com/usage?username=me&api_key=888

To move from JSON to JSONP, with a callback function called xyz, you would change this API call to this:

    http://stream.datasift.com/usage.jsonp?username=me&api_key=888&callback=xyz

Notice that we changed the endpoint from usage to usage.jsonp to instruct DataSift to use JSONP.

 

 

I need something!

How do I save DataSift objects to a database?

We've created a PHP example to illustrate how to write code that works with a database.

Where are the API Client Libraries?

They're held on GitHub, the code-sharing site. Here's what they say about their site: "GitHub has grown into an application used by over a million people to store over two million code repositories, making GitHub the largest code host in the world."

Our Client Libraries page has the up-to-date list of the libraries we offer.

How do I keep up with changes to the API?

We make all our development announcements on Twitter. Just follow @DataSiftDev. We'd love you to follow @DataSift, too.

 

 

Something isn't working!

Is your service down?

We are constantly striving to build the best platform to unlock the power and data of social media. From time to time it may be necessary to disrupt our services to perform maintenance and upgrades. You can check on the status of DataSift, and check any scheduled work on the DataSift Status Dashboard. Any changes to the platform status are also announced on @DataSiftAPI

What do I do when I hit the rate limit?

The rate limit is designed to ensure that everyone plays fair. Essentially, you can use the Streaming API and the /stream endpoint of the REST API without hitting limits but for activities such as compiling or validating CSDL code through the REST API, DataSift applies limits. If you find yourself hitting the limits, you might have to wait for up to one hour. Each REST API endpoint has its own rate limit cost.

Why did the hash for my stream change?

If you're working in the GUI, it's because you edited the stream. If you're compiling using the /compile endpoint of the REST API, it's because you sent new CSDL. The bottom line here is that if you change the CSDL, the hash will change.

 

Tweet: 

Billing FAQ

 

Basics

How do I determine the total cost of running a stream?

The total cost of running a stream depends on the complexity of the stream (which we measure in DPUs), how long you run the stream for, and the number of output objects it produces.

For example, each Tweet that we deliver to you costs $0.0001. If you create a stream that costs 1 DPU and run it for 6 hours, and receive 4,000 Tweets, the cost will be:

       1 * 6 * 0.20 + 0.0001 * 4000  =  $1.60

You cannot predict the license fee in advance because it is impossible to predict how many messages users will post. News stories, by their very nature, often come as complete surprises. However, you might decide to make an estimate of the traffic a stream will generate by running a test, either via DataSift's API or GUI, for a few minutes and extrapolating those results.

How do I determine the total cost of running a Historics query?

The cost of running a Historics query depends on data processing usage plus the licensing costs. Data processing usage is calculated based on the duration of the Historics query and the sample size of the output data; it is deducted from the monthly DPU usage. Licensing costs depend on the volume of data retrieved.

For example, you created a Historics Query of a simple stream of Tweets that costs 0.1 DPU, for the timeframe of one month, with 10 percent as the sample output data size. The data processing usage for this Historics query is calculated to be 288 DPU and it is deducted from your monthly DPU allowance. As per the usage statistics, the volume of data retrieved is a total of 1,212,194 augmentations and sources. Since Twitter charges $0.10 for every 1000 Tweets that we retrieve for the query, hence the licensing costs of the Historics Query will be $121.21 approximately. 

What are licensing fees?

DataSift charges licensing fees on behalf of our partner sites such as Twitter. The license fee that you pay is exactly proportional to the number of objects your stream produces. Or in case of Historics queries, the number of objects retrieved by your query.

If want to create very highly targeted streams, you should typically expect to receive a low volume of data and so your license fees will be very low. For example, a filter for Tweets about hippopotamuses sent by authors with an unusually high Klout score within a radius of 20 miles of San Diego zoo, isn't going to generate very much output, even on days when the hippos do something exceptional.

On the other hand, a filter that looks for any mention of, say, music will probably generate substantially more output and cost more in license fees.

What's a DPU?

A DPU is a Data Processing Unit, a reflection of the computational complexity for the processing that you perform on the DataSift platform. A higher number represents a more complex stream. We measure DPUs on a per-hour basis because running a stream for five hours costs five times as much as running it for one hour.

What is the cost of running a 1 DPU stream for one hour?

It costs 20 US cents. If you purchase a subscription, you will benefit from a discount.

How long does it take to run a Historics query?

It depends on the capacity of our data archive at the time of creating a Historics query, as well as the timeframe and sample size of your Historics query. The timeframe of the query is the duration between the start date and time, and the end date and time of the query. The sample size of the output data can be either 100 percent or 10 percent of all the available data.

When you create a Historics query, it needs to access our data archive in order to retrieve output data for a selected timeframe. Our data archive could be very busy when multiple Historics queries are running at the same time. If the data archive is running over capacity, your query will be queued; that is, it will have to wait for access until other queries accessing the data archive have been executed. Although the queuing process takes a little time, keep in mind that Historics queries, once they are running, retrieve data 30 times faster than a real-time filter.

Once your Historics query has access to our data archive, it then depends on the timeframe of your query and the sample size of the output data. A Historics query with a shorter timeframe and a sample size of 10 percent is likely to execute more quickly than a query with a longer timeframe and a sample size of 100 percent.

What if I stop a Historics query midway? Will I still be charged for it?

Yes, you will be charged even if you stop a Historics query midway. You will be charged for the licensing costs of the volume of data retrieved until you stopped the query. The data processing usage until you stopped the query will also be deducted from your monthly DPU usage.

What's the difference between credits and DPUs?

You purchase credits on the DataSift platform. Credits are priced in US dollars.

A credit costs $1. One credit is equivalent to 5 DPUs so the effective price of one DPU is 20 cents.

Why do you charge using DPUs?

DataSift believes customers should only pay for what they consume. DataSift is a cloud platform, allowing you to consume only what you need and retain the flexibility to scale, either up or down, whenever necessary. The DPU amount that you pay is determined by the complexity of the rules you create.

What are the benefits of this flexible pricing?

Applications need to handle dynamic loads to survive. DataSift provides dynamic vertical scaling to handle unexpected data spikes as well as horizontal build out to support application growth over time.

How can I optimize my streams to use fewer DPUs?

Visit our Optimization page first. If you need further guidance, please contact our Support team.

How can I check my billing?

Via the REST API:
     For Streams, hit the /dpu endpoint to find details of DPUs for a stream.
     For Streams, hit the /usage endpoint to find how many objects DataSift has delivered to you.
     For Historics queries, hit the historics/prepare endpoint to get details of DPUs for your Historics query.   

Via the DataSift GUI, visit our Billing page.

Why is my DPU cost higher than I expected?

The minumum DPU charge is currently 1 DPU per hour, no matter how simple your stream is.

If you run just one stream that costs just 0.1 DPU, the total charge is 1 DPU per hour, which equates to 20 cents. In other words, the minimum DPU cost to use the platform is 20 cents per hour.

However, if you run ten streams, and they all cost 0.1 DPU, DataSift will still charge only 1 DPU per hour for all ten.

Why can't I pay a fixed price?

You can pay a fixed price for the processing but not for the licensing. We offer a range of subscriptions which include prepaid DPUs. The license cost of the content is variable and depends on the number of objects your stream returns. Clients who choose to prepay for DPUs benefit from a discount.

What is the charge for a Historics Preview?

A Historics Preview is charged at a fixed cost of 20 DPUs per request. 20 DPUs are deducted from your account after a complete and successful execution of your request. 

Terms of Service

How much latency does DataSift introduce?

Applications succeed or fail based on performance. DataSift dynamically distributes workload into under-utilized CPU and memory resources to provide best-in-class service delivery with very low latency. DataSift consistently performs faster than competitors. A new Tweet, for example, is likely to be available on our platform 1 to 2 seconds after it appears on Twitter.

Do you offer an SLA?

Yes! We are able to offer the most reliable and comprehensive SLA to our customers because we host our own dedicated cloud infrastructure. It is built on a massive-scale Service Orientated Architecture giving our customers peace of mind that we will continue running when others fail.
For more details, please take a look at our Terms and Conditions.

Tweet: 

Understanding Billing

Billing details are explained in full on our Terms page. The cost of a stream depends on how many operators you include. Some operators are more expensive than others. All streams have a fixed cost; some have a variable cost too because some data suppliers charge for their content. Billing for Historics also works in the same way. The following page will provide you with detailed information on how our billing system works. 

Don't forget to take a look at our Billing FAQ too.

Overview

You can compile and preview streams free of charge through the website GUI.

There is a charge to use streams through our APIs. The cost of using a stream via the API is a function of two variables:

 

Data processing effort required to execute the rule

Each rule is assigned an hourly data processing effort, measured in data processing units (DPUs), according to an analysis of its complexity. The simplest rule incurs an hourly cost of 0.1 DPU. However, note that DataSift's minimum charge rate is 1 DPU per hour. Therefore, you can run ten 0.1 DPU streams simultaneously for the same overall DPU cost as one.

Interaction throughput of the rule

The interaction throughput of a rule is the number of data objects it delivers. The cost of accepting a data object depends on the object's source and the licensing agreement we have with the provider. For example, each accepted Tweet costs $0.0001*. That means, if you accept 1,000* Tweets the cost will be $0.10. Note that in order to receive data objects, you must sign the license agreements for a number of data sources, including Twitter, on the license page of the website. 

 

*Subject to change.

Payment plans

There are two types of payment plan which differ in the approach to charging for the data processing cost, allowing you to optimize according to your usage pattern:

 

On Demand

Each DPU is charged at a fixed rate of $0.20 per hour* so, for example a rule rated at 1.5 DPU is charged at $0.30* per hour.
 

Note that DataSift's minimum charge is $0.20 per hour, so a 0.5 DPU rule would cost $0.20 per hour. If you use DataSift's multistream capability and run 10 streams simultaneously, and all those streams are rated at 0.1 DPU, the total is 1 DPU and so the total cost to run the stream is $0.20.
 

Whenever you want, you can buy credits in increments of $10 which allow you to run streams. As your streams run we continuously compute the combined DPU and throughput cost and reduce your credit balance. If your balance drops to zero, your streams stop until you top up your balance.

 

Monthly subscription

You agree to buy a fixed number of DPU hours per month for a fixed price. As your streams run they consume your fixed DPU allowance and, separately, incur a variable licensing fee. The licensing fees are calculated depending on the licensing agreement we have with the provider. Assuming you don't exhaust your DPU allowance, your monthly bill will be the fixed cost plus the licensing fees. If you do exceed your DPU allowance, the excess DPU hours are charged at the on-demand rate.
 

You must also set a variable cost limit for your monthly subscription to DataSift. The variable cost limit is the sum of:

  • Licensing costs from your data sources and augmentations
  • Excess DPU costs if you run over your monthly DPU allowance 

As long as the combined total of your license costs and excess DPU costs are less than your set variable cost limit, you will be able to consume data normally.
 

But if you run over your variable cost limit, your streams will stop, and you will receive the following error message as part of your stream:

  {"error":"You need to have credits or a valid subscription to use the API."}

 

*Subject to change.

Whereas a rule's data processing rate is certain as soon as it is defined, its throughput is impossible to predict, it can only be estimated. You might want to run some sample executions to get a feel for the throughput cost of a stream.

Notifications

The DataSift billing system calculates the cost of using streams from the DPU rate and licensing costs. DataSift also allows you to monitor your usage by enabling notifications via email and the Dashboard. The notifications vary depending on the type of payment plan you are on.

On Demand

If you choose the On Demand plan, you will receive notifications if your credit balance runs low or falls to zero.

Monthly subscription

In a monthly subscription, you can set a variable cost limit on your account. You will receive notifications when you are close to and if you reach your variable cost limit. You can set or change your variable cost limit any time during the billing cycle.

The first notification is triggered when you have used up 80 percent of your variable cost limit. For example, if you set your variable cost limit to $2,500, you will receive the first notification when you have used up $2,000 on your account. You will receive the second notification when you reach your variable cost limit, at which point we will stop your streams. It is good practice to monitor your usage and ensure that your variable cost limit is always high enough for you to be certain that you will not have any problems for the duration of the month.
 

          

                                                                             Preview of notifications in Dashboard

 

           

                                                                             Preview of notifications via email
 

If you notice that you are close to your variable cost limit and then you raise it, you might be below 80 percent of the new limit or you might be above 80 percent of the new limit; it all depends on where you set your new limit.

For example, if you set the variable cost limit to $2,000 on your account, you receive the first notification when you have used up $1,600. Suppose that you receive that notification and you raise the variable cost limit to $2,500. There are two possible scenarios to consider:

  • If you are below 80 percent of the new variable cost limit, which is $2,000, you would receive both the notifications: a warning when you reach 80 percent of the new variable cost limit and then a notification when you reach your variable cost limit.
  • If you were above 80 percent usage, you will only receive a notification when you reach your new variable cost limit.

 

Billing for Historics queries

You can use Historics queries if you are on a monthly subscription, subject to one-time activation by your account manager. The cost of running a Historics query depends on data processing usage plus licensing costs, and the original DPU complexity of the stream you are running the query on.

Data processing usage for Historics is calculated based on the duration and sample size of the output data. The duration of the query is determined using the timeframe of the query, that is the duration between the start date and time, and the end date and time of the query. The sample size of the output data can be either 100 percent or 10 percent. For all Historics queries, there is a premium on the DPU usage compared to usage for live streaming. DPU usage for the 100% sample size is 125% of what you would pay for live streaming of the same filter. Similarly, for the 10% sample size, the DPU usage is 40% of what you would pay for live streaming of the same filter.

Hence, when you create a Historics query, DataSift is able to calculate the DPU usage before the query is executed. This DPU usage information is displayed on the Confirm New Historic Query page. When running a Historics query through the Historics API, you need to hit the historics/prepare endpoint to create a Historics query and get the total DPU breakdown for your Historics query before it is executed. DPU usage charges are deducted from the monthly DPU allowance. 

On the other hand, licensing costs are calculated based on the volume of data retrieved for a particular Historics query. For a given CSDL filter, licensing costs for a Historics query of 100 percent sample size will be more than for a Historics query of 10 percent sample size.

You can view usage statistics for Historics queries on the Billing page. You can view total licensing costs and the DPU usage for your Historics queries. You can also view the volume of data retrieved by a Historics query and the number of Historics hours used. Alternatively, you can hit the usage endpoint in DataSift API which will give you a more accurate figure for the number of objects processed.
 

          

Billing for Historics Preview

Historics Preview is available for all accounts, on any payment plan, be it Subscription or Pay As You Go. Each request has a fixed cost of 20 DPUs. There are no licensing fees charged for a Historics Preview since you will not be actually receiving any interactions matching your filter. You will ony receive aggregate statistics for your selected filter.

The 20 DPUs are deducted from your account only after a complete and successful execution of your Historics Preview request. If your request gets interrupted while it is being processed, you won't get charged. You can only request a single Historics Preview per stream; if you request a new one, the previous request is overwritten.
 

Billing information

Find your DPU cost via the GUI

In DataSift's GUI you can check the DPU breakdown:

1. Select a stream

2. Click View Definition

The DPU breakdown appears below your CSDL code.

Find your DPU cost via the API

DataSift's REST API provides a dpu endpoint that gives the total DPU cost for a rule and the breakdown of its individual elements. 

    api.datasift.com/dpu

For Historics, DataSift's REST API provides a historics/prepare endpoint that gives the total DPU breakdown for a Historics query.

    api.datasift.com/historics/prepare

Find your throughput via the API

DataSift's REST API provides a usage endpoint that gives the number of object processed.

    api.datasift.com/usage


Cost of operators

Some operators in CSDL have a fixed DPU cost while others have a variable cost.

For fixed-cost operators you simply multiply the number of times you use the operator in a stream by its DPU cost. For example, if you use the contains operator twice in a stream the cost is 0,2 DPUs. 

Operator or Keyword DPUs
contains variable - see below
substr 0.1
contains_any variable - see below
contains_near 0.2
exists 0.1
in variable - see below
comparisons (==, > and so on) 0.1
regular expressions variable - see below
geo_box 0.1
geo_radius 0.1
geo_polygon variable - see below
tag variable - see below


Reg​ular Expressions 

The DPU cost of a regular expression is calculated as:

          cost = the number of characters in the expression divided by 100.

The minimum charge for one regular expression is 0.1 so, for example, a regular expression that includes 10 characters costs 0.1 DPUs while a regular expression that includes 100 characters costs 1.0 DPUs.

geo_polygon 

The DPU cost of a geo_polygon depends on the number of vertices it has. To determine the DPU cost of any geo_polygon, divide the number of vertices by 30.

For example, a hexagon has 6 vertices so it has a DPU cost of 0.2. A triangle has 3 vertices so it has a DPU cost of 0.1.

contains

The DPU cost for the contains operator is based on the number of values you match against and the way you use the operator.

Using the contains operator to find a phrase

    twitter.text contains "My dog ate my homework"

In this case, you can match against up to seven values for a cost of 0.1 DPU. The cost increases by 0.1 DPU as you add more words to the matching phrase. Here are the first few DPU cost bands.

Maximum number of values DPUs
7 0.1
15 0.2
23 0.3
31 0.4
39 0.5
and so on...  

For example this filter has just one word in the argument so it costs 0.1 DPU:

    twitter.text contains "iPad"

This filter has eight words in the argument so it costs 0.2 DPU:

    twitter.text contains "iPad is my favorite tablet device right now"

Using the contains operator to find individual words

    twitter.text contains "xxx" and
    twitter.text contains "yyy" and
    twitter.text contains "zzz"

In this case, you can match against up to three values costs 0.1 DPU. The cost increases by 0.1 DPU for every four extra values you add. Here are the first few DPU cost bands.

Maximum number of values DPUs
3 0.1
7 0.2
11 0.3
15 0.4
19 0.5
and so on...  

in/contains_any

The DPU cost for the in and contains_any operators is based on the number of values you match against. The following table shows the DPU cost for any filter that uses these operators.

For example, this filter matches against 10 values so it costs 0.2 DPUs.

    twitter.text contains_any "apple, microsoft, hp, dell, oracle, google, yahoo, ebay, amazon, facebook"
 

Maximum number of values DPUs
9 0.1
19 0.2
29 0.3
39 0.4
...  
100 1
1,000 2
10,000 4
100,000 8

The exact cost is determined using a sliding scale, so if you have 99 values in the command, the cost will be slightly lower than 1 DPU. Note that the table shows how we calculate DPU costs for a list of single keywords. In practise, you will often write filters that use the contains_any keyword with a list of phrases of varying length. For example:

    twitter.text contains_any "Yesterday, Yellow Submarine, The Long and Winding Road"

Since phrases take longer for DataSift to process than single keywords, the DPU cost is slightly higher. For example, a list of 30 single keywords with the contains_any operator incurs a DPU cost of 0.4. However, if you filter for 10 phrases, each of three words, the DPU cost is 0.5.

We recommend that you check the DPU cost before you run a filter. The /compile endpoint returns a JSON object that includes the DPU cost.

Tags

Operators used inside a tag statement are normally charged at 10% of their usual DPU cost.  

For example, if the normal cost of a rule is 1 DPU, that same code inside a tag statement would cost 0.1 DPU.

If the normal cost is less than 1 DPU, there is no charge.

Tweet: 

Understanding the Output Data

For each data source, discover how to fine tune the data you receive.

 

Feeds

Bitly

Blogs

Boards

Demographics

Facebook

NewsCred

Twitter

Videos

Augmentations

Gender

Interaction

Klout Score

Klout Profile

Klout Topics

Language

Links

Salience Sentiment

 

DailyMotion and Youtube belong to the Videos family of feeds. Amazon, Flickr, IMDb, Reddit, Topix, and 2channel belong to the Boards family.

NoteThis section describes the data that you can consume from DataSift. It does not describe the targets that you can filter against. Read Targets or Output Data? to learn more.

 

How does it work?

Objects flow into DataSift one by one. Each of them contains a rich variety of data, much more than you might expect. Think about a Tweet for a moment. As well as the 140-character payload that most people would say 'is' the Tweet, the object contains the author's screen name and a link to their photograph, the number of followers they have, the number of people they follow, and more than 40 other nuggets of information. Each Tweet that you send contains your entire Twitter bio hidden inside, for example.

In DataSift's filtering engine, these individual fields are called targets. For example, if you write this filter:

    twitter.text contains "hippopotamus"

you're instructing DataSift to deliver every Tweet that contains the word hippopotomus and to exclude everything that isn't a Tweet and every Tweet that doesn't mention a hippo.

The targets and operators documentation describes filtering in detail, with examples of how to use every CSDL operator and every target. 

What data is in the output stream?

To use DataSift effectively, developers also need to understand the data that is actually delivered in the output stream and to know how to control what is delievered. This process is entirely separate from filtering.

Here are some key things to learn first. Don't worry, it's simple.

 

Activating data sources.  You're free to select which data sources you want to receive and which ones you don't want. For example, you might choose to take Twitter and Facebook only. Activate the sources on the Browse Data Sources page. However, you can filter on any piece of data, even if you have not activated it. Activating a source means that its data will be available in the output stream.

 

Understanding the data.  A small number of data fields are not available for filtering. For example, the Salience Entities data is available in the output stream (if you activate the Salience Entities source) but if you look at the Salience target page you will not find it listed.

 

Tweet: 

Facebook Data

The Facebook feed delivers a full range of data concerning Facebook posts.

 

Note The Facebook targets page explains how to filter against this data and describes the content.

 

There's information about the author of each post including their name, Facebook id, and a link to their profile page on Facebook. 

The Facebook Open Graph information is available. 

There are details of how many times each post has been "Liked" by other Facebook users, and there's the content of the post, of course.

Components of this feed

There are five separate components in this feed and you can enable or disable them individually. While reading this table, look at the JSON output example below.

If you enable this item: You receive this data:
Facebook author

The "author" element in "facebook"

Facebook to

The "to" element in "facebook"

Facebook open graph

The "og" element in "facebook"

Facebook likes

The "likes" element in "facebook"

Facebook base

The "id", "message", "description", "caption", "type", "application", "source", "link", "name" elements in "facebook"

Output Example

Here's an example, in JSON format, of the data you receive when you activate the Facebook feed.

Tweet: 

Bitly Data

The bitly data source delivers one interaction each time someone follows a shortened bitly link.
 

Note The bitly targets page explains how to filter against this data and describes the content.

Output example

Here's an example, in JSON format, of the data you receive when you activate the bitly feed.


Output Example: Using bitly with the links augmentation

The following example displays the data you receive when you use the bitly target along with the links augmentation.

Tweet: 

Twitter Data

The Twitter feed delivers a wide range of information about Tweets.

Note The Twitter targets page explains how to filter against this data and describes the content.

 

Output Example: a Tweet

Here's an example, in JSON format, of the data you receive when you activate the Twitter feed.

 

Output Example: A Retweet

This object is an example of the format of a Retweet after DataSift has normalized and augmented the content:

 

Output Example: Twitter Deletes

In order to comply with Twitter's Terms of Service, if you are storing data, you must understand how to handle delete messages. Please read our page on Twitter Deletes.

 

Output Example: Twitter User Status Messages

In order to comply with Twitter's Terms of Service with regard to respecting users' privacy settings, you must also handle any User Status Messages you receive as part of your stream. Please read our page on Twitter User Status Messages.

Tweet: 

Boards Data

DataSift's Board data source provides data from a variety of message boards around the world. The highest-traffic boards such as 2channel.net have their own data source in DataSift whereas the Board data source collects posts from many lower-volume message boards.

 

Note The Boards targets page explains how to filter against this data and describes the content.

 

Output example

Here's an example, in JSON format, of the data you receive when you activate the Boards feed.

Tweet: 

Blogs Data

The Blog data source combines material from a wide variety of sites, ranging from well-know hosts such as Blogger with very large numbers of active users to small single-user sites that run as blogs of incorporate a blog.

 

Note The Blogs targets page explains how to filter against this data and describes the content.

 

Output example

Here's an example, in JSON format, of the data you receive when you activate the Blog feed.

Tweet: 

Videos Data

There are many video hosting sites apart from YouTube and, collectively, they hold a massive amount of data. The Video data source collects content from many of the lesser-known video hosting sites. Use this data source in conjunction with the YouTube and DailyMotion sources for maximum coverage.

Note The Videos targets page explains how to filter against this data and describes the content.

 

Output Example

Here's an example, in JSON format, of the data you receive when you activate the Videos feed.

Tweet: 

Gender Data

The Demographic is an in-house augmentation by DataSift. It analyzes an author's name and location to derive their likely gender. 

The location is significant because usage of names might vary by country. For instance, Jan is likely to be a woman's name in Britain but a man's name in Scandinavian countries.

Note The Gender Demographic augmentation page explains how to filter against this data and describes the content.

 

Output Example

Here's an example, in JSON format, of the data you receive when you activate the Gender Demographic augmentation.

Tweet: 

Interaction Data

The Interaction data source is an in-house augmentation by DataSift. Objects from all the feeds that come into DataSift are also available in the Interaction augmentation.

Note The Interaction augmentation page explains how to filter against this data and describes the content.

 

Here's an illustration. Suppose that a series of objects arrive in real time, one-by-one. The first happens to be a Tweet, the next is from Myspace, and the next is from Facebook. It's clear that objects coming from Twitter, from Myspace, and from Facebook have some fundamental differences. However, they have common features too. For example, there's the main payload of the message being conveyed, there's the author's name, and there might be geo information. DataSift identifies and collects these common features and makes them available in the Interaction augmentation.

In cases where you require only this type of core information, it is easier to extract it from the "interaction" family of values in the output object rather than to try to take the content of a Tweet from a Twitter object, and then the message from a Facebook object, and so on.

To receive data from a feed, you need to Activate it on the Browse Data Sources page. If you want to receive only the data that's automatically available in Interaction, keep the feed activated but disable all its individual componets on the My Data Sources page.

Let's use Twitter as an example:

  1. Activate Twitter on the Browse Data Sources page.
  2. Go to the My Data Sources page.
  3. Click the down arrow on the right-hand side of the row for Twitter.
  4. Uncheck all of its four components: base, place, retweets, and user.

In this way, Twitter is activated but all of its components are disabled. Twitter data will be available and DataSift will deliver it to you in the interaction elements of the output object, but it will not bloat the output objects with all the Twitter elements. In the example below, we've indicated that the JSON object might or might not contain more than just the interaction values, it's entirely your choice.

A note about filtering

So far, we've been looking at the data that DataSift delivers to you, but the interaction augmentation also makes filtering easier. It is quicker to write a filter that uses one target rather than several. There is no need to write this CSDL:

    twitter.text contains "elephants"
    or myspace.content contains "elephants"
    or facebook.message contains "elephants"

You could write it in a shorter form like this:

    interaction.content contains "elephants"

See the Interaction augmentation page for details of the Interaction targets that you can include in a CSDL filter.

Output Example

Here's an example, in JSON format, of the data you receive when you activate the Interaction augmentation.

Tweet: 

Klout Score Data

The Klout Score augmentation gives author's main score from Klout.com.

Note The Klout augmentation page explains how to filter against this data and describes the content.

 

Output Example

Here's an example, in JSON format, of the data you receive when you activate the Klout Score augmentation.

Tweet: 

Klout Profile Data

The Klout Profile augmentation gives an author's main Klout score from Klout.com plus a detailed breakdown of their profile on Klout. For example, it includes their network impact, a measure of the influence an author's audience.

See the Klout augmentation page for details of the Klout targets that you can include in a CSDL filter.

Output Example

Here's an example, in JSON format, of the data you receive when you activate the Klout Profile augmentation.

Tweet: 

Language Data

The Language augmentation analyzes a post to determine which language it is written in.

Note The Language augmentation page explains how to filter against this data and describes the content.

Output Example

Here's an example, in JSON format, of the data you receive when you activate the Language augmentation.

Tweet: 

Links Meta Output Data

The Links Meta augmentation gives you detailed information about a web page. This includes metadata specific to Facebook Open Graph, Twitter Cards, Google News and standard Search Engines.
 

Note The Links augmentation page explains how to filter against this data and describes the content. 

Components of this feed

There are nine main components in this feed and you can enable or disable them individually. While reading this table, take a look at the JSON output examples below. 

If you enable this item: You receive this data:
Links twitter_card 

The "twitter" elements in "links.meta"

Links charset

The "charset" element in "links.meta"

Links lang

The "lang" element in "links.meta"

Links description

The "description" element in "links.meta"

Links keywords 

The "keywords" element in "links.meta"

Links opengraph

The "opengraph" elements in "links.meta"

Links news:keywords

The "newskeywords" element in "links.meta"

Links standout

The "standout" element in "links.meta"

Links content_type 

The "content_type" element in "links.meta"

Of the components listed above:

  • The "twitter" elements represent metadata from Twitter Cards.
  • The "opengraph" elements represent metadata from Open Graph.
  • The "standout" and "newskeywords" elements represent metadata from Google News feeds.
  • The "charset", "lang", "description", "keywords", and "content_type" elements represent metadata from standard Search Engines.

Output Example: Open Graph

Here's an example, in JSON format, of the data you receive when you filter using the Links Meta augmentation for Open Graph.

 

Output Example: Twitter Cards

Here's an example of the data you receive when you filter using the Links Meta augmentation for Twitter Cards.

 

Output Example: Google News and Search Engines

Here's an example of the data you receive when you filter using the Links Meta augmentation for Google News or standard Search Engines.

Tweet: 

Links Data

The Links augmentation gives details of hyperlinks that it finds in posts. It fully resolves all shortened and multiply shortened links

Note The Links augmentation page explains how to filter against this data and describes the content.

 

Components of this feed

There are eighteen separate components in this feed, but not all of them have to be present in the JSON output (see the example near the end of this page). When they are, you will find them defined as keys of the links dictionary. The values of those keys may be lists of one of more items or dictionaries. Some of the links componenets are grouped inside the meta key and its sub-keys. The following list gives you an overview of the output hierarchy, While reading it, look at the JSON output example below.

  • code
  • created_at
  • domain
  • hops
  • meta
    • charset (standard Search Engines)
    • content_type (standard Search Engines)
    • description (standard Search Engines)
    • keywords (standard Search Engines)
    • lang (standard Search Engines)
    • news:keywords (Google News)
    • opengraph (Open Graph)
    • standout (Google News)
    • twitter (Twitter Card)
  • normalized_url
  • retweet_count
  • title
  • url

Output Examples

The following examples of output illustrate differences in the JSON output you are likely to see depending on the type of data source used to generate them.

Here's a generic example, in JSON format, of the data you receive when you activate the Links augmentation.

Data source-specific augmentations will be listed as values of the meta key in the JSON output. The following examples show differences in output generated from Facebook Open Graph, Twitter Cards, Google News and standard Search Engines.
 

Output Example: Open Graph

Here's an example, in JSON format, of the data you receive when you filter using the Links Meta augmentation for Open Graph.

 

Output Example: Twitter Cards

Here's an example of the data you receive when you filter using the Links Meta augmentation for Twitter Cards.

 

Output Example: Google News and Search Engines

Here's an example of the data you receive when you filter using the Links Meta augmentation for Google News or standard Search Engines.

Tweet: 

Salience Sentiment Data

The Salience Sentiment augmentation measures the positive or negative sentiment expressed in the content of a post or in its title. Salience Sentiment works with material written in English, French, Spanish, and Portuguese.

Note The Salience augmentation page explains how to filter against this data and describes the content.

 

Components of this feed

There are two separate components in this feed and you can enable or disable them individually. While reading this table, look at the JSON output example below.

If you enable this item: You receive this data:
Content sentiment

The "content":{"sentiment"}" element in "salience"

Title sentiment

The "title":{"sentiment"}" element in "salience"

Output Example

Here's an example, in JSON format, of the data you receive when you activate the Salience Sentiment augmentation.

Tweet: 

Salience Entities Data

The Salience entities augmentation is an example of data that can be delivered in your output stream but is not filterable. The augmentation adds a list the entities it finds in each post. An entity might be a job title, a product name, a company, a location, a person or a quote. It could also be a pattern of commonly found words that are typically followed by something that is easily predictable. For example: "The meeting will take place at...." is normally followed by a time of day.

DataSift's entities analysis is provided by Lexalytics. The entities engine independently analyzes the content and the title. With this source you will receive both.

Currently you can receive Salience Entites information in our output stream but you cannot filter on them. Take a look at Targets vs Output Data to learn about the difference between filtering on data and consuming that data.

See the Salience augmentation page for details of the other Salience targets that you can include in a CSDL filter.

Salience Entities works with material written in English, French, Spanish, and Portuguese.

 

Output Example

Here's an example, in JSON format, of the data you might receive if you activate this augmentation.

Tweet: 

Demographics

The Demographic feed delivers high-level Tweet information with personal details such as the author's user-id, screen-name and avatar removed. It offers an excellent source of detailed data for statistical analysis. 

This chart illustrates what happens when you activate the Demographics data source while a stream is running. Assuming that you activated Twitter before you ran the stream, you will initially receive Tweets which include the authors' name, username, and so on, along with the Twitter biography details that authors provide on their Settings page in Twitter. The data will not include detailed demographic data.

 

 

If you now activate the Demographics data source, the content that your stream delivers will change immediately. The stream now contains detailed demographic information but interactions are anonymized; they no longer contain names, usernames, or links to a user's profile page on Twitter.

You can switch back to Twitter any time you want, but remember that it is a global setting, applying to all of the streams currently running.

 

Warning

When you enable the Demographic data source, your Twitter data automatically ceases to arrive. To protect privacy, we ensure that Twitter and Demographics data are mutually exclusive. If you are running and recording a Twitter stream, be very careful. As soon as you activate Demographics, you will cease to receive Tweets. This applies to streams that are already running.

 

A word about ids

To preserve anonymity, we do not reveal Twitter users' ids when you choose to receive Demographics. Instead, we map each Twitter user id to a new id which is available in the hash_id element in the interaction section of the JSON output.

This process enables you to determine whether two messages are from the same author or not. Details of the way the new ids are generated are not available and there is no way to map the new id back to the original Twitter user id.

In this JSON output example, the id element, which is generated internally by DataSift, serves as a unique identifier for this interaction. It cannot be used to identify the author. The hash_id is derived from the original Twitter user id. For example, the Twitter user id for @DataSift is 155505157 but this user id would never appear in its original form in an anonymixed Tweet. In case you're wondering, the hash_id shown above does not correspond to the @Datasift Twitter account.

 

Output example: an anonymized Tweet

Here's an example, in JSON format, of the data you receive when you activate the Twitter feed.

 

Output Example: an Anonymized Retweet

The structure of an anonymized Retweet is slightly different. The JSON contains the Retweet family of elements.

 

Elements in the JSON Output

The output contains these elements.

Element Examples or Description

status
    relationship

single, married, parents, engaged, divorced

status
    work
working, students, unemployed, retirees
type person, company
sex male, female
age_range
    start
    end

start can be: 0, 17, 20, 25, 30, 35, 40, 50, 60, 70.

end can be: 16, 19, 24, 29, 34, 39, 49, 59, 69.

location
    country
 
location
     us_state
The state name is presented in full, not as a 2-character code
location
    city
 
accounts
    categories
actors, musicians, TV celebrities, comedians, news, tech brands, ...
likes_and_interests music, fashion, news, sports, ...
first_language  
professions musicians, journalists, artists, programmers, marketers, ...
services_and_technologies Blogger, Tumblr, Wordpress, iPhone, Blackberry, ...
twitter
    activity
For example: 1-5 tweets/day
twitter
    accounts.large
Lady Gaga, Eminem, TwitPic, CNN News, ...
twitter
    accounts_followed
Number of accounts the author follows on Twitter
main_street
    shop_at
Ikea, Costco, Walmart, Office Depot, Target, ...
main_street
    dressed_by
Victoria's Secret, Burberry, Footlocker, Calvin Klein, Nike, ...
main_street
    eat_and_drink_at
McDonald's, Starbucks, Taco Bell, ...

 

Tweet: 

Klout Topics Data

The Klout Topics augmentation gives an author's main Klout score from Klout.com plus a list of the topics that Klout believes they have a degree of influence over.

See the Klout augmentation page for details of the Klout targets that you can include in a CSDL filter.

Output Example

Here's an example, in JSON format, of the data you receive when you activate the Klout Topics augmentation.

Tweet: 

NewsCred Data

The NewsCred data source delivers news articles, images, and videos from more than 750 of the world's highest-quality sources, including leading financial and entertainment publications in a fully license-compliant way.

 

Note The Newscred targets page explains how to filter against this data and describes the content.

 

Output examples

Image

Here's an example, in JSON format, of the data you receive when you activate the NewsCred data source. This object describes an image:

 

Article

This object describes an article:

 

Video

This object describes a video:

Tweet: 

Reference: Stream Output Elements

Here's a list of all the output elements your JSON might contain:

 

2ch->content

2ch->contenttype

2ch->created_at

2ch->domain

2ch->id

2ch->link

2ch->thread

2ch->title

2ch->type

 

amazon->author->link

amazon->content

amazon->contenttype

amazon->created_at

amazon->domain

amazon->id

amazon->link

amazon->thread

amazon->title

amazon->type

 

bitly->cname

bitly->country

bitly->country_code

bitly->created_at

bitly->domain

bitly->geo_city

bitly->geo->latitude

bitly->geo->longitude

bitly->geo_region

bitly->geo_region_code

bitly->id

bitly->referring_domain

bitly->referring_url

bitly->share->hash

bitly->timezone

bitly->type

bitly->url

bitly->url_hash

bitly->user->agent

 

blog->author->link

blog->author->name

blog->content

blog->contenttype

blog->created_at

blog->domain

blog->id

blog->link

blog->post->created_at

blog->post->link

blog->post->title

blog->title

blog->type

 

board->author->age

board->author->gender

board->author->link

board->author->location

board->author->registered

board->content

board->contenttype

board->created_at

board->domain

board->id

board->link

board->review->recommendation

board->review->ticker

board->thread

board->title

board->type

 

dailymotion->author->link

dailymotion->author->name

dailymotion->category

dailymotion->content

dailymotion->contenttype

dailymotion->created_at

dailymotion->duration

dailymotion->id

dailymotion->tags

dailymotion->title

dailymotion->type

dailymotion->videolink

 

demographic->gender

 

facebook->application

facebook->author->avatar

facebook->author->id

facebook->author->link

facebook->author->name

facebook->caption

facebook->created_at

facebook->description

facebook->id

facebook->likes->count

facebook->link

facebook->message

facebook->name

facebook->og->

facebook->og->by

facebook->og->length

facebook->og->page

facebook->og->par

facebook->og->photos

facebook->og->published time

facebook->source

facebook->type

flickr->author->link

flickr->content

flickr->contenttype

flickr->created_at

flickr->domain

flickr->id

flickr->link

flickr->thread

flickr->title

flickr->type

 

imdb->author->link

imdb->content

imdb->contenttype

imdb->created_at

imdb->domain

imdb->id

imdb->link

imdb->thread

imdb->title

imdb->type

 

interaction->author->avatar

interaction->author->contributions

interaction->author->id

interaction->author->link

interaction->author->name

interaction->author->talk

interaction->author->username

interaction->content

interaction->contenttype

interaction->created_at

interaction->geo->latitude

interaction->geo->longitude

interaction->id

interaction->link

interaction->schema->version

interaction->source

interaction->title

interaction->type

 

klout->score

 

language->confidence

language->tag

 

links->meta->keywords

links->meta->opengraph->activity

links->meta->opengraph->author

links->meta->opengraph->cause

links->meta->opengraph->city

links->meta->opengraph->country

links->meta->opengraph->description

links->meta->opengraph->director

links->meta->opengraph->email

links->meta->opengraph->fax_number

links->meta->opengraph->geo->latitude

links->meta->opengraph->geo->longitude

links->meta->opengraph->image

links->meta->opengraph->locality

links->meta->opengraph->musician

links->meta->opengraph->non_profit

links->meta->opengraph->phone_number

links->meta->opengraph->region

links->meta->opengraph->site_name

links->meta->opengraph->sport

links->meta->opengraph->title

links->meta->opengraph->type

links->meta->opengraph->url

links->meta->opengraph->website

links->meta->twitter->app->googleplay->id

links->meta->twitter->app->googleplay->name

links->meta->twitter->app->googleplay->url

links->meta->twitter->app->ipad->id

links->meta->twitter->app->ipad->name

links->meta->twitter->app->ipad->url

links->meta->twitter->app->iphone->id

links->meta->twitter->app->iphone->name

links->meta->twitter->app->iphone->url

links->meta->twitter->card

links->meta->twitter->creator

links->meta->twitter->creator_id

links->meta->twitter->description

links->meta->twitter->image

links->meta->twitter->image_height

links->meta->twitter->image_width

links->meta->twitter->player

links->meta->twitter->player_height

links->meta->twitter->player_stream

links->meta->twitter->player_stream_content_type

links->meta->twitter->player_width

links->meta->twitter->site

links->meta->twitter->site_id

links->meta->twitter->title

links->meta->twitter->url

 

newscred->article->category

newscred->article->content

newscred->article->domain

newscred->article->fulltext

newscred->article->link

newscred->article->title

newscred->article->topics->name

newscred->id

newscred->image->attribution->text

newscred->image->caption

newscred->image->links->large

newscred->image->links->small

newscred->image->size->height

newscred->image->size->width

newscred->modified_at

newscred->published_at

newscred->source->circulation

newscred->source->company_type

newscred->source->country

newscred->source->domain

newscred->source->founded

newscred->source->frequency

newscred->source->id

newscred->source->link

newscred->source->name

newscred->source->owner

newscred->source->thumbnail

newscred->type

newscred->updated

newscred->video->category

newscred->video->domain

newscred->video->link

newscred->video->thumbnail

newscred->video->title

newscred->video->topics->name

 

 

reddit->author->link

reddit->content

reddit->contenttype

reddit->created_at

reddit->domain

reddit->id

reddit->link

reddit->thread

reddit->title

reddit->type

 

salience->content->entities->about

salience->content->entities->confident

salience->content->entities->evidence

salience->content->entities->label

salience->content->entities->name

salience->content->entities->sentiment

salience->content->entities->type

salience->content->sentiment

salience->content->topics->hits

salience->content->topics->name

salience->content->topics->score

salience->title->entities->about

salience->title->entities->confident

salience->title->entities->evidence

salience->title->entities->label

salience->title->entities->name

salience->title->entities->sentiment

salience->title->entities->type

salience->title->sentiment

salience->title->topics->hits

salience->title->topics->name

salience->title->topics->score

 

 

topix->author->link

topix->author->location

topix->author->registered

topix->content

topix->contenttype

topix->created_at

topix->domain

topix->id

topix->link

topix->thread

topix->title

topix->type

 

twitter->created_at

twitter->filter_level

twitter->geo->latitude

twitter->geo->longitude

twitter->id

twitter->in_reply_to_screen_name

twitter->in_reply_to_status_id

twitter->in_reply_to_user_id

twitter->lang

twitter->media->display_url

twitter->media->expanded_url

twitter->media->id

twitter->media->id_str

twitter->media->media_url

twitter->media->media_url_https

twitter->media->sizes->large->h

twitter->media->sizes->large->resize

twitter->media->sizes->large->w

twitter->media->sizes->medium->h

twitter->media->sizes->medium->resize

twitter->media->sizes->medium->w

twitter->media->sizes->small->h

twitter->media->sizes->small->resize

twitter->media->sizes->small->w

twitter->media->sizes->thumb->h

twitter->media->sizes->thumb->resize

twitter->media->sizes->thumb->w

twitter->media->source_status_id

twitter->media->source_status_id_str

twitter->media->type

twitter->media->url

twitter->place->attributes->locality

twitter->place->attributes->region

twitter->place->attributes->street_address

twitter->place->country

twitter->place->country_code

twitter->place->full_name

twitter->place->id

twitter->place->name

twitter->place->place_type

twitter->place->url

twitter->retweet->count

twitter->retweet->created_at

twitter->retweeted->created_at

twitter->retweeted->geo->latitude

twitter->retweeted->geo->longitude

twitter->retweeted->id

twitter->retweeted->place->attributes->locality

twitter->retweeted->place->attributes->region

twitter->retweeted->place->attributes->street_address

twitter->retweeted->place->country

twitter->retweeted->place->country_code

twitter->retweeted->place->full_name

twitter->retweeted->place->id

twitter->retweeted->place->name

twitter->retweeted->place->place_type

twitter->retweeted->place->url

twitter->retweeted->source

twitter->retweeted->user->created_at

twitter->retweeted->user->description

twitter->retweeted->user->favourites_count

twitter->retweeted->user->followers_count

twitter->retweeted->user->friends_count

twitter->retweeted->user->geo_enabled

twitter->retweeted->user->id

twitter->retweeted->user->id_str

twitter->retweeted->user->lang

twitter->retweeted->user->listed_count

twitter->retweeted->user->location

twitter->retweeted->user->name

twitter->retweeted->user->profile_image_url

twitter->retweeted->user->screen_name

twitter->retweeted->user->statuses_count

twitter->retweeted->user->time_zone

twitter->retweeted->user->url

twitter->retweeted->user->utc_offset

twitter->retweeted->user->verified

twitter->retweet->id

twitter->retweet->lang

twitter->retweet->media->display_url

twitter->retweet->media->expanded_url

twitter->retweet->media->id

twitter->retweet->media->id_str

twitter->retweet->media->media_url

twitter->retweet->media->media_url_https

twitter->retweet->media->sizes->large->h

twitter->retweet->media->sizes->large->resize

twitter->retweet->media->sizes->large->w

twitter->retweet->media->sizes->medium->h

twitter->retweet->media->sizes->medium->resize

twitter->retweet->media->sizes->medium->w

twitter->retweet->media->sizes->small->h

twitter->retweet->media->sizes->small->resize

twitter->retweet->media->sizes->small->w

twitter->retweet->media->sizes->thumb->h

twitter->retweet->media->sizes->thumb->resize

twitter->retweet->media->sizes->thumb->w

twitter->retweet->media->source_status_id

twitter->retweet->media->source_status_id_str

twitter->retweet->media->type

twitter->retweet->media->url

twitter->retweet->source

twitter->retweet->text

twitter->retweet->user->created_at

twitter->retweet->user->description

twitter->retweet->user->favourites_count

twitter->retweet->user->followers_count

twitter->retweet->user->friends_count

twitter->retweet->user->geo_enabled

twitter->retweet->user->id

twitter->retweet->user->id_str

twitter->retweet->user->lang

twitter->retweet->user->listed_count

twitter->retweet->user->location

twitter->retweet->user->name

twitter->retweet->user->profile_image_url

twitter->retweet->user->screen_name

twitter->retweet->user->statuses_count

twitter->retweet->user->time_zone

twitter->retweet->user->url

twitter->retweet->user->utc_offset

twitter->retweet->user->verified

twitter->source

twitter->status

twitter->text

twitter->user->created_at

twitter->user->description

twitter->user->favourites_count

twitter->user->followers_count

twitter->user->friends_count

twitter->user->geo_enabled

twitter->user->id

twitter->user->id_str

twitter->user->lang

twitter->user->listed_count

twitter->user->location

twitter->user->name

twitter->user->profile_image_url

twitter->user->screen_name

twitter->user->statuses_count

twitter->user->time_zone

twitter->user->url

twitter->user->utc_offset

twitter->user->verified

 

video->author->link

video->author->name

video->category

video->content

video->contenttype

video->created_at

video->duration

video->id

video->tags

video->title

video->type

video->videolink

 

wikipedia->author->contributions

wikipedia->author->link

wikipedia->author->talk

wikipedia->author->username

wikipedia->body

wikipedia->changetype

wikipedia->created_at

wikipedia->diff->from

wikipedia->diff->htmldiff

wikipedia->diff->to

wikipedia->id

wikipedia->langlinks->lang

wikipedia->langlinks->title

wikipedia->langlinks->url

wikipedia->links->link

wikipedia->links->namespace

wikipedia->links->ns

wikipedia->namespace

wikipedia->newlen

wikipedia->ns

wikipedia->oldlen

wikipedia->pageid

wikipedia->parentid

wikipedia->previousid

wikipedia->sections->anchor

wikipedia->sections->byteoffset

wikipedia->sections->fromtitle

wikipedia->sections->index

wikipedia->sections->level

wikipedia->sections->line

wikipedia->sections->number

wikipedia->sections->toclevel

wikipedia->title

wikipedia->type

 

youtube->author->link

youtube->author->name

youtube->category

youtube->commentslink

youtube->content

youtube->contenttype

youtube->created_at

youtube->duration

youtube->id

youtube->title

youtube->type

youtube->videolink

Tweet: 

Salience Topics Data

The Salience Topics data source gives the list of topics that Salience found in the content or title of a post. See the Salience augmentation page for details of the Salience targets that you can include in a CSDL filter. Note: this augmentation is available only for material written in English.

Read more about Salience Topics.

Output Example

Here's an example, in JSON format, of the data you receive when you activate the Salience Topics data source.

Tweet: 

Wikipedia Data

The Wikipedia feed delivers information relating to edits and new contributions at Wikipedia.org.

Note The Wikipedia targets page explains how to filter against this data and describes the content.

 

Output example

Here's an example, in JSON format, of the data you receive when you activate the Wikipedia feed.

Tweet: 

Salience Topics

DataSift's topics analysis engine independently analyzes the content and the title of a post and derives a list of topics for each. The topics are:

Advertising Disasters Investing Social Media
Agriculture Economics Labor Software and Internet
Art Education Law Space
Automotive Elections Marriage Sports
Aviation Fashion Mobile Devices Technology
Banking Food Politics Traditional Energy
Beverages Hardware Real Estate Travel
Biotechnology Health Renewable Energy Video Games
Business Hotels Robotics War
Crime Intellectual Property Science Weather

Any post might contain multiple topics. For example, a Tweet such as "I'm flying to the K-12 Science Fair in Seattle this morning" might match against "Education" and "Science" and "Travel".

The Salience topics are derived from an analysis of the entire content in Wikipedia™. Each Salience topic is defined by a list of keywords. For example, Aviation is defined as "aviation, airplane, flying".

Here are some more examples:

  • Banking    banking, bank, mortgage, checking, savings   
  • Beverages    beverage, alcohol, soda   
  • Biotechnology    biotech, biotechnology, applied_biology, gene_therapy, genetic_engineering  
  • Business    business, management, executive, company, shareholder, mba

To learn more, take a look at Salience's documentation on topics.

 

Tweet: 

Trends Data

The Trends augmentation gives details of trending topics on Twitter.

 

The Trends augmentation page explains how to filter against this data and describes the content.

 

Output Example

Here's an example, in JSON format, of the data you receive when you activate the Trends augmentation.

Tweet: 

Historics Archive Schema

As DataSift's platform and Historics product have developed, the Historical Archive's schema has evolved from just unaugmented Tweets originally to the wide range of augmented data sources you have access to today. We have broken this evolving schema down by the dates on which the schema changed.

The existing schema versions are:

Tweet: 

Historical Archive Schema - August 2012 to October 2012

Timeframe

August 2012 to October 2012

Data Sources

  • Twitter
  • Facebook
  • Demographics

Augmentations

  • Gender
  • Interaction
  • Klout
  • Language
  • Links
  • Salience Sentiment
  • Salience Entities and Topics
  • Trends
Tweet: 

Historical Archive Schema - November 2012 to present

Timeframe

November 2012 to present

Data Sources

  • Twitter
  • Facebook
  • Demographics

Augmentations

  • Gender
  • Interaction
  • Klout
  • Language
  • Links
  • Salience Sentiment
  • Salience Entities and Topics
  • Trends
Tweet: 

Historical Archive Schema - Pre December 2011

Timeframe

Pre December 2011

Data Sources

  • Twitter

Augmentations

  • Partial

Notes

This early portion of our Historics archive contains data that is only partially augmented. So, if you are filtering purely on an augmentation, you may not receive results. For example, if the interactions in the timeframe you are querying have not had the links augmentation applied, this filter will not return data:

 

This Archive does not include the following:

  • Retweets
  • Retweet counts
  • Replies
  • Follower count
  • Friend count
  • Favorites count
  • Listed count
  • User verified statuses
  • Twitter Places do not have complete place information - only the place ID.
Tweet: 

Historical Archive Schema - December 2011 to April 2012

Timeframe

December 2011 to April 2012

Data Sources

  • Twitter

Augmentations

  • Gender
  • Interaction
  • Klout Score (deprecated Klout.class and Klout.slope targets from Mar 24, 2012)
  • Klout Profile (available from Feb 14, 2012)
  • Klout Topics (available from Mar 24, 2012)
  • Language
  • Links
  • Salience Sentiment (French, Portugese and Spanish support added from Apr 20, 2012)
  • Salience Entities and Topics (available from Feb 14, 2012)
  • Trends

Notes

During this period the schema is subject to change. Many new augmentations were added, and some were modified. Please ensure your DataSift stream consumer is capable of dealing with cases where unexpected fields may exist. 

Tweet: 

Historical Archive Schema - July 2012

Timeframe

July 2012

Data Sources

  • Twitter

Augmentations

  • Gender
  • Interaction
  • Klout
  • Language
  • Links
  • Salience Sentiment
  • Salience Entities and Topics
  • Trends

Notes

During this period the schema is subject to change. Please ensure your DataSift stream consumer is capable of dealing with cases where unexpected fields may exist. 

Tweet: 

Client Libraries

Programming APIs

We offer client libraries for the most popular programming languages.

Click to find out how to download a library, check which features it supports, and see step-by-step examples of how to benefit from our code.

Our engineering team is always looking at new candidate languages. If your favorite language does not yet have a DataSift client library, contact Support for the latest news.

Community-built libraries

We're happy to see that the DataSift development community has already started to add to the set of libraries we provided at launch.

Other code examples

Keywords: 
Tweet: 

Targets or Output Data?

A key point to understand is that targets are not the same as output data:

  • A target is something that you can filter on
  • Output data is something that you receive from DataSift

You will encounter three different cases: when data is available in both a target and in the output, when it is available only as a target, and when it is available only in the output.

 

What fields appear in output streams?

Take a look at our complete list of data elements in your output streams.

 

Both

The most common case is when information is available both as a target and in the output. For example, when you filter for people on Twitter who have more than 1,000,000 followers, the output that you receive will include their exact follower count.

    twitter.user.followers_count > 1000000

In this CSDL code, twitter.user.followers_count is the target. 

You can see the output element, in JSON format, in the following code snippet:

In this example, your CSDL filter tells you nothing about the content that you will receive. The filter simply finds all Tweets posted by people with a large number of followers.

Target only

You will see cases where information is available as a target so you can filter on it, but it is not in the output. For example:

    interaction.sample < 10

Output data only

Finally, you will find cases where information is not available as a target, so it is not filterable, but it does appear in the output. For example: the created_at field is often presented in the output data but it is not a target.

If you enable the Salience augmentation, you will find that the Salience Entites data is delivered but you cannot filter against it.

Tweet: 

Things to Look at First


DataSift is very simple to use. You simply build filters to find the information you want.

If you're new to the product, we suggest you read these pages first.

 

1. The filters are written in our programming language. It's called the Curated Stream Definition Language (CSDL).

2. The key feature of CSDL is its set of operators.

3. The filters take information from social media sites and augmentations. Together these are called targets.

4. You can write applications that access DataSift through our APIs.

5. You can test your filters in our user interface before you start to explore the APIs.

6. Please read our Understanding Billing guide and Billing FAQ.

7. Please read Targets vs Output Data to understand the difference between filtering data and consuming data.

 

Tweet: