DataSift is the complete real-time social media solution for enterprises looking for comprehensive data that delivers actionable insight.
DataSift is Big Data. We have 100% of the Twitter Firehose always available in real time.
Our media curation platform is cloud-based and highly scalable. It delivers a variety of metadata including enriched augmentations such as geo location, social influence, and sentiment analysis in one place, so you will never miss a thing. Delivered through our APIs or through our extensive analytics partnerships, our flexible pricing means that whether you're a large enterprise or a single developer you can get access to the data that you need.
DataSift filters for information as it is posted. For example, you could filter for:
You can aggregate data to monitor streams of messages from more than one social media site simultaneously, or you can exclude individual sites.
We augment our real-time streams with third-party solutions such as Peer Index, Klout, and Lexalytics. Then our users can filter that data or have it delivered to them.
Here's a full list of the information your filters can target.
You can also filter against the DataSift Historics archive, a large body of content gathered from a variety of social media sites. Historics is useful when you want to turn the clock back and filter against data from the past.
It uses the same filtering language that you use when you're looking at live data. For Twitter, it works 100 times faster than live streaming and it offers 100% coverage.
The REST API enables developers to access DataSift's core functionality. Using DataSift's simple, powerful programming language, you can access the social media sites you want to monitor. The REST API provides easy ways to test and compile your code. You can run one or more streams at the same time and export data in real time.
The Streaming API offers data from all our sources in near real time. This API is for those developers building applications that do the heavy lifting, capturing continuous streams of data with no defined end and making sure that nothing is missed. If you're building an application for a major data-mining task, the Streaming API is the place to start. It also allows you to run multiple streams at the same time, filtering for two or more different sets of information simultaneously

DataSift is very simple to use. Build filters to find the information you want in real time.
1. The filters are written in our programming language. It's called the Curated Stream Definition Language (CSDL).
2. The key feature of CSDL is its set of operators.
3. The filters take information from social media sites and augmentations. Together these are called targets.
4. You can write applications that access DataSift through our APIs.
5. You can test your filters in our user interface before you start to explore the APIs.
6. Please read our Understanding Billing guide and Billing FAQ.
7. Please read Targets vs Output Data to understand the difference between filtering data and consuming data.
We've rewritten this page but the earlier version is still available if you need it.
![]() |
Are you trying to set up Push or Historics? You're in the right place. Read this page and then go to Client Library: Push and Historics. |
The best way to understand how our client libraries work is to take a high-level view to understand the general concepts and then dip down into your language of choice. All the client libraries follow the same basic principals that we describe here. Class and method names may vary slightly from library to library. This document is meant as a general overview of the concepts shared across the different language bindings.
If you need code examples right away:
We're gradually building our low-level documentation for each client library. This section is still in beta but you're very welcome to look:
All interaction with the API libraries begins with an instance of the User class. Most other objects you will create while using the API require you to pass a User object, to which you have given your DataSift login and an API key.
All other objects that use the DataSift API require a User object for authentication. The User class provides methods for most ways you might want to create those other objects.
A Definition object represents a stream definition. These are roughly equivalent to the streams as you see them on the DataSift web interface, except that you cannot store them in your DataSift account for later retrieval.
A Historic object represents a Historics query. These are also roughly equivalent to the queries you can create via the DataSift web interface.
This object creates and manages connections to the DataSift streaming server. The default connection type used is HTTP, but some of the libraries also support WebSockets. This object is roughly equivalent to the code that runs in the browser when you're receiving data from a stream in the DataSift API.
When you create a StreamConsumer you supply a way to receive events from the stream. Events include connection, interactions, errors, status messages and more.
All of the libraries throw exceptions (or their language's equivalent) when errors occur. It's important that you catch and handle all possible errors for every call you make.
This section is in pseudocode but you can translate it almost directly into your language of choice.
We start by creating a User object.
We now have a choice depending on what we actually want to do. We can create a StreamConsumer directly from the User object if we have a stream hash that we want to consume. For the sake of example we're going to say that we don't have the hash, we have some CSDL that we want to use. To do that we need to create a Definition object to represent that CSDL.
The first thing we should do is validate that the CSDL we've supplied is valid. We do this using the validate() method on the Definition object. As mentioned above, if the validate() method encounters an error (for example, if compiling the CSDL fails) it will throw an exception, so we make sure to catch that. However, we must also make sure we catch other errors that may occur. The example below uses a generic catch block to handle things like authentication or connectivity problems, but in your production code you should always catch specific exceptions.
Exception handling will be omitted from the remainder of this document, but please make sure you are correctly handling all exceptions that may be thrown by the library in your code, otherwise your program may terminate without warning while you're sleeping, and you'll miss out on some of the lovely data you want to consume.
Now that we know our CSDL compiles properly we can move on to creating a context for consuming data via a streaming connection. We start by creating our event handler.
For most libraries there is a reference class (or interface) which your event handler must implement. This defines the methods that must exist and the parameters they take. The most notable exception to this is the Ruby library which currently uses blocks rather than an event handler class.
The first method is onConnect() which gets called when a connection is successfully established with the DataSift streaming server.
The opposite of this event is getting disconnected. There's an event for that, too.
Let's take a moment here to look at the parameter being passed to these two events. All events will get the StreamConsumer object which is raising the event as the first parameter. This enables the handlers to make calls on the StreamConsumer.
The data coming down the streaming connection consists of a mixture of interaction objects, status messages, warnings and errors. We have events that handle each of these.
Status messages trigger the onStatus() method.
Status messages can contain additional information and this will be passed in the info HashMap. For example, a status of type "progress" for Historic queries will contain the percentage complete in info['progress'].
Errors and warnings trigger the following methods.
The data being received (interactions) trigger one of two methods: onInteraction and onDeleted. Essentially the data passed for a deletion notification is in the same format as normal interactions but only contains the data required for you to identify the interaction that has been deleted so you can delete it from your own storage systems.
Note that properly handling delete notification is required for you to remain compliant with some of the licenses you have signed.
Interactions trigger the onInteraction() method.
And that completes the EventHandler class.
Now that we have an event handler ready to receive events we can get a StreamConsumer from our Definition object.
The first parameter is the type of consumer we want. Most of the libraries only support HTTP streaming at this time, but some also support WebSockets. The second parameter is an instance of our EventHandler class.
We can now start to consume data. In most of the libraries this call will not return, so if your program needs to do other things while connected to the stream you'll need to wrap your usage of the API library in a thread.
The library will now compile the definition if necessary, connect to the streaming server, and start consuming data.
The above discussion focused on consuming a single definition. Most of the libraries support consuming multiple definitions through the same stream connection. When doing that the event handler methods will get passed the hash of the stream which matched the interaction in addition to the other parameters.
StreamConsumer object they are passed.
We're happy to see that the DataSift development community has already started to add to the set of libraries we provided at launch.
Download a copy of the Java library from our GitHub Java page. There are two ways to do this.
git clone https://www.github.com/datasift/datasift-java.git
These steps test the REST API (api.datasift.com) and then the streaming API (stream.datasift.com).
| Endpoint | Supported? |
| https://api.datasift.com/validate | Yes |
| https://api.datasift.com/compile | Yes |
| https://api.datasift.com/stream | Yes |
| https://api.datasift.com/dpu | Yes |
| https://api.datasift.com/usage | Yes |
| https://stream.datasift.com | Yes |
| ws://websocket.datasift.com | No * |
* Note that there are two endpoints for the streaming API:
The DataSift client libraries use HTTP streaming by default for most languages. There is no technical restriction on hitting the websockets API endpoint. The choice of protocol makes no difference to the data objects that are returned.
Step 1. Make sure you look at the Streaming API section of the PHP code example example. It shows how to call the API.
Step 2. Add your username and API key to the code.
Step 3. Look at the format of an object from the source you want to work with. Identify the fields that you want to save to a database.
Step 4. Create a table in your database. Create columns to represent the data you will store.
Step 5. Execute this on your mysql server to create the sample table structure used in the example.
The PHP code will extract information from the stream and save it into the newly created table.
Here's some sample SQL code to create the table:
Download a copy of the library from our GitHub Node.JS page. There are two ways to do this.
git clone https://github.com/datasift/NodeJS-Consumer.git
These steps test the streaming API (stream.datasift.com).
| Endpoint | Supported? |
| http://api.datasift.com/validate | No |
| http://api.datasift.com/compile | No |
| http://api.datasift.com/stream | No |
| http://api.datasift.com/dpu | No |
| http://api.datasift.com/usage | No |
| http://stream.datasift.com | Yes |
| ws://websocket.datasift.com | No * |
* Note that there are two endpoints for the streaming API:
The DataSift client libraries use HTTP streaming by default for most languages. There is no technical restriction on hitting the websockets API endpoint. The choice of protocol makes no difference to the data objects that are returned.
Download a copy of the library from our GitHub PHP page. There are two ways to do this.
git clone https://www.github.com/datasift/datasift-php.git
These steps test the REST API (api.datasift.com) and then the streaming API (stream.datasift.com).
| Endpoint | Supported? |
| http://api.datasift.com/validate | Yes |
| http://api.datasift.com/compile | Yes |
| http://api.datasift.com/stream | Yes |
| http://api.datasift.com/dpu | Yes |
| http://stream.datasift.com/usage | Yes |
| http://stream.datasift.com | Yes |
| ws://websocket.datasift.com | No * |
* Note that there are two endpoints for the streaming API:
The DataSift client libraries use HTTP streaming by default for most languages. There is no technical restriction on hitting the websockets API endpoint. The choice of protocol makes no difference to the data objects that are returned.
Download a copy of our Ruby client library from Github. There are two ways to do this.
git clone https://www.github.com/datasift/datasift-ruby.git
These steps test the REST API (api.datasift.com) and then the streaming API (stream.datasift.com).
| Endpoint | Supported? |
| http://api.datasift.com/validate | Yes |
| http://api.datasift.com/compile | Yes |
| http://api.datasift.com/stream | Yes |
| http://api.datasift.com/dpu | Yes |
| http://api.datasift.com/usage | Yes |
| http://stream.datasift.com | Yes |
| ws://websocket.datasift.com | No * |
* Note that there are two endpoints for the streaming API:
The DataSift client libraries use HTTP streaming by default for most languages. There is no technical restriction on hitting the websockets API endpoint. The choice of protocol makes no difference to the data objects that are returned.
On top of concentrating the raw data of multiple social media channels into a single stream, DataSift adds value to each data object it channels by augmenting it with other useful data. For example the standing of the author may be available, or the sentiment or language of the message. This enables you to write more incisive filters and receive more valuable results.
There are different types of augmentation:
DataSift uses data and analysis from the following providers to bring this extra insight:
Not all data sources receive all types of augmentation. Language, Salience and Links are available for all sources; the other augmentations are currently availabe only for Twitter:

You access DataSift's APIs in two ways:
This is the recommended appropach for production environments.
You need to include a header that contains: "Auth: username:apikey"
Most client libraries already support this authorization system.
Accessing a DataSift API endpoint using GET/POST parameters is the simpler approach. It is sometimes useful for testing. Here's an example that calls the REST API to compile a CSDL filter:
https://api.datasift.com/compile?csdl=<csdl>&username=<username>&api_key=<api_key>
Note that when you hit the Streaming API using the Websocket protocol, you must authenticate in the URI using GET parameters - there is no other option. In this case, the URI format is:
ws://websocket.datasift.com/<hash>?username=<username>&api_key=<api_key>
The acronym "API" stands for "Application Programming Interface". An API is a defined way for a program to accomplish a task, usually by retrieving or modifying data. In DataSift's case, we provide API methods for all the core functionality. Programmers use the DataSift API to build applications that work with our platform. Programs talk to the API using HTTP or websockets.
NOTE: your API key and DataSift username are both case sensitive.
To use the DataSift API, the first thing you need is your API key.
1. Log in to DataSift.
2. Go to the Dashboard or to the Settings page.
3. Click the Copy to Clipboard icon under "Developer API Key".
Note that DataSift does not display the API key until you have purchased credits. If you have no credits yet, you cannot make an API call.
1. Log in to DataSift.
2. Click on the Streams tab.
3. Select a stream.
4. Click the green "Use stream" button.
In DataSift, an interaction is one object from a source website. For example, an interaction from Twitter is a Tweet, including all the meta information such as the name of the author, the number of followers the author has, the number of people the author follows, and so on. It can also include augmentation information such as the language in which the content is written and the sentiment, positive or negative, conveyed in the message. An interaction is typically delivered to your client application as a JSON object.
Filters sit at the very heart of DataSift's engine. You write them in our CSDL programming language. You can think of a filter as the logic that decides which input objects DataSift will deliver to you and which ones it will discard.
A stream is the output of a filter. The terms are quite close. In fact, you might hear some developers refer to filters as streams.
A target is an individual field of information supplied by one of our social media partners sich as Twitter, by a third-party augmentation such as Klout, or by additional processing performed by DataSift itself, such as language analysis or gender detection. For example:
A post is a generic DataSift term for a message from one of our social media partners. Posts can be Tweets, blog entires, blog comments, Myspace content, and so on.
CSDL is our programming language, the Curated Stream Definition Language. Every DataSift developer learns CSDL because it is the language you use to write filters. It's a very simple, compact compiled language that runs exceptionally quickly and is easy to learn.
There is currently no limit to the number of streams you can create through our Streaming or REST APIs. You can create a maximum of 1,000 streams through our GUI. For full details on our API usage policy, please see API Rate Limiting.
JSON (JavaScript Object Notation) is the default format for the data that our APIs return. Wikipedia introduces JSON as "a lightweight text-based open standard designed for human-readable data interchange."
To use JSONP you need to understand how to:
You define your JavaScript callback function on your web page and name it as your src parameter when you call a DataSift API. The callback function needs to be able to process standard DataSift objects but also be able to handle the other types of message that it might receive, such as ticks or error messages. A tick simply indicates that a connection is open but receiving no data. Ticks look like this:
{"tick":1336057708,"status":"initialised","message":"Waiting for data"}
Your API call needs to include the name of the callback function. For example, suppose that you're using this API call with JSON and, to keep things simple, suppose that your username is "me" and your api_key is 888:
http://stream.datasift.com/usage?username=me&api_key=888
To move from JSON to JSONP, with a callback function called xyz, you would change this API call to this:
http://stream.datasift.com/usage.jsonp?username=me&api_key=888&callback=xyz
Notice that we changed the endpoint from usage to usage.jsonp to instruct DataSift to use JSONP.
We've created a PHP example to illustrate how to write code that works with a database.
They're held on GitHub, the code-sharing site. Here's what they say about their site: "GitHub has grown into an application used by over a million people to store over two million code repositories, making GitHub the largest code host in the world."
Our Client Libraries page has the up-to-date list of the libraries we offer.
We make all our development announcements on Twitter. Just follow @DataSiftDev. We'd love you to follow @DataSift, too.
We are constantly striving to build the best platform to unlock the power and data of social media. From time to time it may be necessary to disrupt our services to perform maintenance and upgrades. You can check on the status of DataSift, and check any scheduled work on the DataSift Status Dashboard. Any changes to the platform status are also announced on @DataSiftAPI
The rate limit is designed to ensure that everyone plays fair. Essentially, you can use the Streaming API and the /stream endpoint of the REST API without hitting limits but for activities such as compiling or validating CSDL code through the REST API, DataSift applies limits. If you find yourself hitting the limits, you might have to wait for up to one hour. Each REST API endpoint has its own rate limit cost.
If you're working in the GUI, it's because you edited the stream. If you're compiling using the /compile endpoint of the REST API, it's because you sent new CSDL. The bottom line here is that if you change the CSDL, the hash will change.
The total cost of running a stream depends on the complexity of the stream (which we measure in DPUs), how long you run the stream for, and the number of output objects it produces.
For example, each Tweet that we deliver to you costs $0.0001. If you create a stream that costs 1 DPU and run it for 6 hours, and receive 4,000 Tweets, the cost will be:
1 * 6 * 0.20 + 0.0001 * 4000 = $1.60
You cannot predict the license fee in advance because it is impossible to predict how many messages users will post. News stories, by their very nature, often come as complete surprises. However, you might decide to make an estimate of the traffic a stream will generate by running a test, either via DataSift's API or GUI, for a few minutes and extrapolating those results.
The cost of running a Historics query depends on data processing usage plus the licensing costs. Data processing usage is calculated based on the duration of the Historics query and the sample size of the output data; it is deducted from the monthly DPU usage. Licensing costs depend on the volume of data retrieved.
For example, you created a Historics Query of a simple stream of Tweets that costs 0.1 DPU, for the timeframe of one month, with 10 percent as the sample output data size. The data processing usage for this Historics query is calculated to be 288 DPU and it is deducted from your monthly DPU allowance. As per the usage statistics, the volume of data retrieved is a total of 1,212,194 augmentations and sources. Since Twitter charges $0.10 for every 1000 Tweets that we retrieve for the query, hence the licensing costs of the Historics Query will be $121.21 approximately.
DataSift charges licensing fees on behalf of our partner sites such as Twitter. The license fee that you pay is exactly proportional to the number of objects your stream produces. Or in case of Historics queries, the number of objects retrieved by your query.
If want to create very highly targeted streams, you should typically expect to receive a low volume of data and so your license fees will be very low. For example, a filter for Tweets about hippopotamuses sent by authors with an unusually high Klout score within a radius of 20 miles of San Diego zoo, isn't going to generate very much output, even on days when the hippos do something exceptional.
On the other hand, a filter that looks for any mention of, say, music will probably generate substantially more output and cost more in license fees.
A DPU is a Data Processing Unit, a reflection of the computational complexity for the processing that you perform on the DataSift platform. A higher number represents a more complex stream. We measure DPUs on a per-hour basis because running a stream for five hours costs five times as much as running it for one hour.
It costs 20 US cents. If you purchase a subscription, you will benefit from a discount.
It depends on the capacity of our data archive at the time of creating a Historics query, as well as the timeframe and sample size of your Historics query. The timeframe of the query is the duration between the start date and time, and the end date and time of the query. The sample size of the output data can be either 100 percent or 10 percent of all the available data.
When you create a Historics query, it needs to access our data archive in order to retrieve output data for a selected timeframe. Our data archive could be very busy when multiple Historics queries are running at the same time. If the data archive is running over capacity, your query will be queued; that is, it will have to wait for access until other queries accessing the data archive have been executed. Although the queuing process takes a little time, keep in mind that Historics queries, once they are running, retrieve data 30 times faster than a real-time filter.
Once your Historics query has access to our data archive, it then depends on the timeframe of your query and the sample size of the output data. A Historics query with a shorter timeframe and a sample size of 10 percent is likely to execute more quickly than a query with a longer timeframe and a sample size of 100 percent.
Yes, you will be charged even if you stop a Historics query midway. You will be charged for the licensing costs of the volume of data retrieved until you stopped the query. The data processing usage until you stopped the query will also be deducted from your monthly DPU usage.
You purchase credits on the DataSift platform. Credits are priced in US dollars.
A credit costs $1. One credit is equivalent to 5 DPUs so the effective price of one DPU is 20 cents.
DataSift believes customers should only pay for what they consume. DataSift is a cloud platform, allowing you to consume only what you need and retain the flexibility to scale, either up or down, whenever necessary. The DPU amount that you pay is determined by the complexity of the rules you create.
Applications need to handle dynamic loads to survive. DataSift provides dynamic vertical scaling to handle unexpected data spikes as well as horizontal build out to support application growth over time.
Visit our Optimization page first. If you need further guidance, please contact our Support team.
Via the REST API:
For Streams, hit the /dpu endpoint to find details of DPUs for a stream.
For Streams, hit the /usage endpoint to find how many objects DataSift has delivered to you.
For Historics queries, hit the historics/prepare endpoint to get details of DPUs for your Historics query.
Via the DataSift GUI, visit our Billing page.
The minumum DPU charge is currently 1 DPU per hour, no matter how simple your stream is.
If you run just one stream that costs just 0.1 DPU, the total charge is 1 DPU per hour, which equates to 20 cents. In other words, the minimum DPU cost to use the platform is 20 cents per hour.
However, if you run ten streams, and they all cost 0.1 DPU, DataSift will still charge only 1 DPU per hour for all ten.
You can pay a fixed price for the processing but not for the licensing. We offer a range of subscriptions which include prepaid DPUs. The license cost of the content is variable and depends on the number of objects your stream returns. Clients who choose to prepay for DPUs benefit from a discount.
A Historics Preview is charged at a fixed cost of 20 DPUs per request. 20 DPUs are deducted from your account after a complete and successful execution of your request.
Applications succeed or fail based on performance. DataSift dynamically distributes workload into under-utilized CPU and memory resources to provide best-in-class service delivery with very low latency. DataSift consistently performs faster than competitors. A new Tweet, for example, is likely to be available on our platform 1 to 2 seconds after it appears on Twitter.
Yes! We are able to offer the most reliable and comprehensive SLA to our customers because we host our own dedicated cloud infrastructure. It is built on a massive-scale Service Orientated Architecture giving our customers peace of mind that we will continue running when others fail.
For more details, please take a look at our Terms and Conditions.
Billing details are explained in full on our Terms page. The cost of a stream depends on how many operators you include. Some operators are more expensive than others. All streams have a fixed cost; some have a variable cost too because some data suppliers charge for their content. Billing for Historics also works in the same way. The following page will provide you with detailed information on how our billing system works.
Don't forget to take a look at our Billing FAQ too.
You can compile and preview streams free of charge through the website GUI.
There is a charge to use streams through our APIs. The cost of using a stream via the API is a function of two variables:
Data processing effort required to execute the ruleEach rule is assigned an hourly data processing effort, measured in data processing units (DPUs), according to an analysis of its complexity. The simplest rule incurs an hourly cost of 0.1 DPU. However, note that DataSift's minimum charge rate is 1 DPU per hour. Therefore, you can run ten 0.1 DPU streams simultaneously for the same overall DPU cost as one. Interaction throughput of the ruleThe interaction throughput of a rule is the number of data objects it delivers. The cost of accepting a data object depends on the object's source and the licensing agreement we have with the provider. For example, each accepted Tweet costs $0.0001*. That means, if you accept 1,000* Tweets the cost will be $0.10. Note that in order to receive data objects, you must sign the license agreements for a number of data sources, including Twitter, on the license page of the website.
*Subject to change. |
There are two types of payment plan which differ in the approach to charging for the data processing cost, allowing you to optimize according to your usage pattern:
On DemandEach DPU is charged at a fixed rate of $0.20 per hour* so, for example a rule rated at 1.5 DPU is charged at $0.30* per hour. Note that DataSift's minimum charge is $0.20 per hour, so a 0.5 DPU rule would cost $0.20 per hour. If you use DataSift's multistream capability and run 10 streams simultaneously, and all those streams are rated at 0.1 DPU, the total is 1 DPU and so the total cost to run the stream is $0.20. Whenever you want, you can buy credits in increments of $10 which allow you to run streams. As your streams run we continuously compute the combined DPU and throughput cost and reduce your credit balance. If your balance drops to zero, your streams stop until you top up your balance.
Monthly subscriptionYou agree to buy a fixed number of DPU hours per month for a fixed price. As your streams run they consume your fixed DPU allowance and, separately, incur a variable licensing fee. The licensing fees are calculated depending on the licensing agreement we have with the provider. Assuming you don't exhaust your DPU allowance, your monthly bill will be the fixed cost plus the licensing fees. If you do exceed your DPU allowance, the excess DPU hours are charged at the on-demand rate. You must also set a variable cost limit for your monthly subscription to DataSift. The variable cost limit is the sum of:
As long as the combined total of your license costs and excess DPU costs are less than your set variable cost limit, you will be able to consume data normally. But if you run over your variable cost limit, your streams will stop, and you will receive the following error message as part of your stream:
*Subject to change. |
Whereas a rule's data processing rate is certain as soon as it is defined, its throughput is impossible to predict, it can only be estimated. You might want to run some sample executions to get a feel for the throughput cost of a stream.
The DataSift billing system calculates the cost of using streams from the DPU rate and licensing costs. DataSift also allows you to monitor your usage by enabling notifications via email and the Dashboard. The notifications vary depending on the type of payment plan you are on.
If you choose the On Demand plan, you will receive notifications if your credit balance runs low or falls to zero.
In a monthly subscription, you can set a variable cost limit on your account. You will receive notifications when you are close to and if you reach your variable cost limit. You can set or change your variable cost limit any time during the billing cycle.
The first notification is triggered when you have used up 80 percent of your variable cost limit. For example, if you set your variable cost limit to $2,500, you will receive the first notification when you have used up $2,000 on your account. You will receive the second notification when you reach your variable cost limit, at which point we will stop your streams. It is good practice to monitor your usage and ensure that your variable cost limit is always high enough for you to be certain that you will not have any problems for the duration of the month.

Preview of notifications in Dashboard

Preview of notifications via email
If you notice that you are close to your variable cost limit and then you raise it, you might be below 80 percent of the new limit or you might be above 80 percent of the new limit; it all depends on where you set your new limit.
For example, if you set the variable cost limit to $2,000 on your account, you receive the first notification when you have used up $1,600. Suppose that you receive that notification and you raise the variable cost limit to $2,500. There are two possible scenarios to consider:
You can use Historics queries if you are on a monthly subscription, subject to one-time activation by your account manager. The cost of running a Historics query depends on data processing usage plus licensing costs, and the original DPU complexity of the stream you are running the query on.
Data processing usage for Historics is calculated based on the duration and sample size of the output data. The duration of the query is determined using the timeframe of the query, that is the duration between the start date and time, and the end date and time of the query. The sample size of the output data can be either 100 percent or 10 percent. For all Historics queries, there is a premium on the DPU usage compared to usage for live streaming. DPU usage for the 100% sample size is 125% of what you would pay for live streaming of the same filter. Similarly, for the 10% sample size, the DPU usage is 40% of what you would pay for live streaming of the same filter.
Hence, when you create a Historics query, DataSift is able to calculate the DPU usage before the query is executed. This DPU usage information is displayed on the Confirm New Historic Query page. When running a Historics query through the Historics API, you need to hit the historics/prepare endpoint to create a Historics query and get the total DPU breakdown for your Historics query before it is executed. DPU usage charges are deducted from the monthly DPU allowance.
On the other hand, licensing costs are calculated based on the volume of data retrieved for a particular Historics query. For a given CSDL filter, licensing costs for a Historics query of 100 percent sample size will be more than for a Historics query of 10 percent sample size.
You can view usage statistics for Historics queries on the Billing page. You can view total licensing costs and the DPU usage for your Historics queries. You can also view the volume of data retrieved by a Historics query and the number of Historics hours used. Alternatively, you can hit the usage endpoint in DataSift API which will give you a more accurate figure for the number of objects processed.

Historics Preview is available for all accounts, on any payment plan, be it Subscription or Pay As You Go. Each request has a fixed cost of 20 DPUs. There are no licensing fees charged for a Historics Preview since you will not be actually receiving any interactions matching your filter. You will ony receive aggregate statistics for your selected filter.
The 20 DPUs are deducted from your account only after a complete and successful execution of your Historics Preview request. If your request gets interrupted while it is being processed, you won't get charged. You can only request a single Historics Preview per stream; if you request a new one, the previous request is overwritten.
In DataSift's GUI you can check the DPU breakdown:
1. Select a stream
2. Click View Definition
The DPU breakdown appears below your CSDL code.
DataSift's REST API provides a dpu endpoint that gives the total DPU cost for a rule and the breakdown of its individual elements.
api.datasift.com/dpu
For Historics, DataSift's REST API provides a historics/prepare endpoint that gives the total DPU breakdown for a Historics query.
api.datasift.com/historics/prepare
DataSift's REST API provides a usage endpoint that gives the number of object processed.
api.datasift.com/usage
Some operators in CSDL have a fixed DPU cost while others have a variable cost.
For fixed-cost operators you simply multiply the number of times you use the operator in a stream by its DPU cost. For example, if you use the contains operator twice in a stream the cost is 0,2 DPUs.
| Operator or Keyword | DPUs |
|---|---|
| contains | variable - see below |
| substr | 0.1 |
| contains_any | variable - see below |
| contains_near | 0.2 |
| exists | 0.1 |
| in | variable - see below |
| comparisons (==, > and so on) | 0.1 |
| regular expressions | variable - see below |
| geo_box | 0.1 |
| geo_radius | 0.1 |
| geo_polygon | variable - see below |
| tag | variable - see below |
The DPU cost of a regular expression is calculated as:
cost = the number of characters in the expression divided by 100.
The minimum charge for one regular expression is 0.1 so, for example, a regular expression that includes 10 characters costs 0.1 DPUs while a regular expression that includes 100 characters costs 1.0 DPUs.
The DPU cost of a geo_polygon depends on the number of vertices it has. To determine the DPU cost of any geo_polygon, divide the number of vertices by 30.
For example, a hexagon has 6 vertices so it has a DPU cost of 0.2. A triangle has 3 vertices so it has a DPU cost of 0.1.
The DPU cost for the contains operator is based on the number of values you match against and the way you use the operator.
twitter.text contains "My dog ate my homework"
In this case, you can match against up to seven values for a cost of 0.1 DPU. The cost increases by 0.1 DPU as you add more words to the matching phrase. Here are the first few DPU cost bands.
| Maximum number of values | DPUs |
|---|---|
| 7 | 0.1 |
| 15 | 0.2 |
| 23 | 0.3 |
| 31 | 0.4 |
| 39 | 0.5 |
| and so on... |
For example this filter has just one word in the argument so it costs 0.1 DPU:
twitter.text contains "iPad"
This filter has eight words in the argument so it costs 0.2 DPU:
twitter.text contains "iPad is my favorite tablet device right now"
twitter.text contains "xxx" and
twitter.text contains "yyy" and
twitter.text contains "zzz"
In this case, you can match against up to three values costs 0.1 DPU. The cost increases by 0.1 DPU for every four extra values you add. Here are the first few DPU cost bands.
| Maximum number of values | DPUs |
|---|---|
| 3 | 0.1 |
| 7 | 0.2 |
| 11 | 0.3 |
| 15 | 0.4 |
| 19 | 0.5 |
| and so on... |
The DPU cost for the in and contains_any operators is based on the number of values you match against. The following table shows the DPU cost for any filter that uses these operators.
For example, this filter matches against 10 values so it costs 0.2 DPUs.
twitter.text contains_any "apple, microsoft, hp, dell, oracle, google, yahoo, ebay, amazon, facebook"
| Maximum number of values | DPUs |
|---|---|
| 9 | 0.1 |
| 19 | 0.2 |
| 29 | 0.3 |
| 39 | 0.4 |
| ... | |
| 100 | 1 |
| 1,000 | 2 |
| 10,000 | 4 |
| 100,000 | 8 |
The exact cost is determined using a sliding scale, so if you have 99 values in the command, the cost will be slightly lower than 1 DPU. Note that the table shows how we calculate DPU costs for a list of single keywords. In practise, you will often write filters that use the contains_any keyword with a list of phrases of varying length. For example:
twitter.text contains_any "Yesterday, Yellow Submarine, The Long and Winding Road"
Since phrases take longer for DataSift to process than single keywords, the DPU cost is slightly higher. For example, a list of 30 single keywords with the contains_any operator incurs a DPU cost of 0.4. However, if you filter for 10 phrases, each of three words, the DPU cost is 0.5.
We recommend that you check the DPU cost before you run a filter. The /compile endpoint returns a JSON object that includes the DPU cost.
Operators used inside a tag statement are normally charged at 10% of their usual DPU cost.
For example, if the normal cost of a rule is 1 DPU, that same code inside a tag statement would cost 0.1 DPU.
If the normal cost is less than 1 DPU, there is no charge.
For each data source, discover how to fine tune the data you receive.
Feeds |
Augmentations |
DailyMotion and Youtube belong to the Videos family of feeds. Amazon, Flickr, IMDb, Reddit, Topix, and 2channel belong to the Boards family.
NoteThis section describes the data that you can consume from DataSift. It does not describe the targets that you can filter against. Read Targets or Output Data? to learn more.
Objects flow into DataSift one by one. Each of them contains a rich variety of data, much more than you might expect. Think about a Tweet for a moment. As well as the 140-character payload that most people would say 'is' the Tweet, the object contains the author's screen name and a link to their photograph, the number of followers they have, the number of people they follow, and more than 40 other nuggets of information. Each Tweet that you send contains your entire Twitter bio hidden inside, for example.
In DataSift's filtering engine, these individual fields are called targets. For example, if you write this filter:
twitter.text contains "hippopotamus"
you're instructing DataSift to deliver every Tweet that contains the word hippopotomus and to exclude everything that isn't a Tweet and every Tweet that doesn't mention a hippo.
The targets and operators documentation describes filtering in detail, with examples of how to use every CSDL operator and every target.
To use DataSift effectively, developers also need to understand the data that is actually delivered in the output stream and to know how to control what is delievered. This process is entirely separate from filtering.
Here are some key things to learn first. Don't worry, it's simple.
|
Activating data sources. You're free to select which data sources you want to receive and which ones you don't want. For example, you might choose to take Twitter and Facebook only. Activate the sources on the Browse Data Sources page. However, you can filter on any piece of data, even if you have not activated it. Activating a source means that its data will be available in the output stream.
Understanding the data. A small number of data fields are not available for filtering. For example, the Salience Entities data is available in the output stream (if you activate the Salience Entities source) but if you look at the Salience target page you will not find it listed. |
The Facebook feed delivers a full range of data concerning Facebook posts.
Note The Facebook targets page explains how to filter against this data and describes the content.
There's information about the author of each post including their name, Facebook id, and a link to their profile page on Facebook.
The Facebook Open Graph information is available.
There are details of how many times each post has been "Liked" by other Facebook users, and there's the content of the post, of course.
There are five separate components in this feed and you can enable or disable them individually. While reading this table, look at the JSON output example below.
| If you enable this item: | You receive this data: |
| Facebook author |
The "author" element in "facebook" |
| Facebook to |
The "to" element in "facebook" |
| Facebook open graph |
The "og" element in "facebook" |
| Facebook likes |
The "likes" element in "facebook" |
| Facebook base |
The "id", "message", "description", "caption", "type", "application", "source", "link", "name" elements in "facebook" |
Here's an example, in JSON format, of the data you receive when you activate the Facebook feed.
The bitly data source delivers one interaction each time someone follows a shortened bitly link.
Note The bitly targets page explains how to filter against this data and describes the content.
Here's an example, in JSON format, of the data you receive when you activate the bitly feed.
The following example displays the data you receive when you use the bitly target along with the links augmentation.
The Twitter feed delivers a wide range of information about Tweets.
Note The Twitter targets page explains how to filter against this data and describes the content.
Here's an example, in JSON format, of the data you receive when you activate the Twitter feed.
This object is an example of the format of a Retweet after DataSift has normalized and augmented the content:
In order to comply with Twitter's Terms of Service, if you are storing data, you must understand how to handle delete messages. Please read our page on Twitter Deletes.
In order to comply with Twitter's Terms of Service with regard to respecting users' privacy settings, you must also handle any User Status Messages you receive as part of your stream. Please read our page on Twitter User Status Messages.
DataSift's Board data source provides data from a variety of message boards around the world. The highest-traffic boards such as 2channel.net have their own data source in DataSift whereas the Board data source collects posts from many lower-volume message boards.
Note The Boards targets page explains how to filter against this data and describes the content.
Here's an example, in JSON format, of the data you receive when you activate the Boards feed.
The Blog data source combines material from a wide variety of sites, ranging from well-know hosts such as Blogger with very large numbers of active users to small single-user sites that run as blogs of incorporate a blog.
Note The Blogs targets page explains how to filter against this data and describes the content.
Here's an example, in JSON format, of the data you receive when you activate the Blog feed.
There are many video hosting sites apart from YouTube and, collectively, they hold a massive amount of data. The Video data source collects content from many of the lesser-known video hosting sites. Use this data source in conjunction with the YouTube and DailyMotion sources for maximum coverage.
Note The Videos targets page explains how to filter against this data and describes the content.
Here's an example, in JSON format, of the data you receive when you activate the Videos feed.
The Demographic is an in-house augmentation by DataSift. It analyzes an author's name and location to derive their likely gender.
The location is significant because usage of names might vary by country. For instance, Jan is likely to be a woman's name in Britain but a man's name in Scandinavian countries.
Note The Gender Demographic augmentation page explains how to filter against this data and describes the content.
Here's an example, in JSON format, of the data you receive when you activate the Gender Demographic augmentation.
The Interaction data source is an in-house augmentation by DataSift. Objects from all the feeds that come into DataSift are also available in the Interaction augmentation.
Note The Interaction augmentation page explains how to filter against this data and describes the content.
Here's an illustration. Suppose that a series of objects arrive in real time, one-by-one. The first happens to be a Tweet, the next is from Myspace, and the next is from Facebook. It's clear that objects coming from Twitter, from Myspace, and from Facebook have some fundamental differences. However, they have common features too. For example, there's the main payload of the message being conveyed, there's the author's name, and there might be geo information. DataSift identifies and collects these common features and makes them available in the Interaction augmentation.
In cases where you require only this type of core information, it is easier to extract it from the "interaction" family of values in the output object rather than to try to take the content of a Tweet from a Twitter object, and then the message from a Facebook object, and so on.
To receive data from a feed, you need to Activate it on the Browse Data Sources page. If you want to receive only the data that's automatically available in Interaction, keep the feed activated but disable all its individual componets on the My Data Sources page.
Let's use Twitter as an example:
In this way, Twitter is activated but all of its components are disabled. Twitter data will be available and DataSift will deliver it to you in the interaction elements of the output object, but it will not bloat the output objects with all the Twitter elements. In the example below, we've indicated that the JSON object might or might not contain more than just the interaction values, it's entirely your choice.
So far, we've been looking at the data that DataSift delivers to you, but the interaction augmentation also makes filtering easier. It is quicker to write a filter that uses one target rather than several. There is no need to write this CSDL:
twitter.text contains "elephants"
or myspace.content contains "elephants"
or facebook.message contains "elephants"
You could write it in a shorter form like this:
interaction.content contains "elephants"
See the Interaction augmentation page for details of the Interaction targets that you can include in a CSDL filter.
Here's an example, in JSON format, of the data you receive when you activate the Interaction augmentation.
The Klout Score augmentation gives author's main score from Klout.com.
Note The Klout augmentation page explains how to filter against this data and describes the content.
Here's an example, in JSON format, of the data you receive when you activate the Klout Score augmentation.
The Klout Profile augmentation gives an author's main Klout score from Klout.com plus a detailed breakdown of their profile on Klout. For example, it includes their network impact, a measure of the influence an author's audience.
See the Klout augmentation page for details of the Klout targets that you can include in a CSDL filter.
Here's an example, in JSON format, of the data you receive when you activate the Klout Profile augmentation.
The Language augmentation analyzes a post to determine which language it is written in.
Note The Language augmentation page explains how to filter against this data and describes the content.
Here's an example, in JSON format, of the data you receive when you activate the Language augmentation.
The Links Meta augmentation gives you detailed information about a web page. This includes metadata specific to Facebook Open Graph, Twitter Cards, Google News and standard Search Engines.
Note The Links augmentation page explains how to filter against this data and describes the content.
There are nine main components in this feed and you can enable or disable them individually. While reading this table, take a look at the JSON output examples below.
| If you enable this item: | You receive this data: |
| Links twitter_card |
The "twitter" elements in "links.meta" |
| Links charset |
The "charset" element in "links.meta" |
| Links lang |
The "lang" element in "links.meta" |
| Links description |
The "description" element in "links.meta" |
| Links keywords |
The "keywords" element in "links.meta" |
| Links opengraph |
The "opengraph" elements in "links.meta" |
| Links news:keywords |
The "newskeywords" element in "links.meta" |
| Links standout |
The "standout" element in "links.meta" |
| Links content_type |
The "content_type" element in "links.meta" |
Of the components listed above:
Here's an example, in JSON format, of the data you receive when you filter using the Links Meta augmentation for Open Graph.
Here's an example of the data you receive when you filter using the Links Meta augmentation for Twitter Cards.
Here's an example of the data you receive when you filter using the Links Meta augmentation for Google News or standard Search Engines.
The Links augmentation gives details of hyperlinks that it finds in posts. It fully resolves all shortened and multiply shortened links
Note The Links augmentation page explains how to filter against this data and describes the content.
There are eighteen separate components in this feed, but not all of them have to be present in the JSON output (see the example near the end of this page). When they are, you will find them defined as keys of the links dictionary. The values of those keys may be lists of one of more items or dictionaries. Some of the links componenets are grouped inside the meta key and its sub-keys. The following list gives you an overview of the output hierarchy, While reading it, look at the JSON output example below.
The following examples of output illustrate differences in the JSON output you are likely to see depending on the type of data source used to generate them.
Here's a generic example, in JSON format, of the data you receive when you activate the Links augmentation.
Data source-specific augmentations will be listed as values of the meta key in the JSON output. The following examples show differences in output generated from Facebook Open Graph, Twitter Cards, Google News and standard Search Engines.
Here's an example, in JSON format, of the data you receive when you filter using the Links Meta augmentation for Open Graph.
Here's an example of the data you receive when you filter using the Links Meta augmentation for Twitter Cards.
Here's an example of the data you receive when you filter using the Links Meta augmentation for Google News or standard Search Engines.
The Salience Sentiment augmentation measures the positive or negative sentiment expressed in the content of a post or in its title. Salience Sentiment works with material written in English, French, Spanish, and Portuguese.
Note The Salience augmentation page explains how to filter against this data and describes the content.
There are two separate components in this feed and you can enable or disable them individually. While reading this table, look at the JSON output example below.
| If you enable this item: | You receive this data: |
| Content sentiment |
The "content":{"sentiment"}" element in "salience" |
| Title sentiment |
The "title":{"sentiment"}" element in "salience" |
Here's an example, in JSON format, of the data you receive when you activate the Salience Sentiment augmentation.
The Salience entities augmentation is an example of data that can be delivered in your output stream but is not filterable. The augmentation adds a list the entities it finds in each post. An entity might be a job title, a product name, a company, a location, a person or a quote. It could also be a pattern of commonly found words that are typically followed by something that is easily predictable. For example: "The meeting will take place at...." is normally followed by a time of day.
DataSift's entities analysis is provided by Lexalytics. The entities engine independently analyzes the content and the title. With this source you will receive both.
Currently you can receive Salience Entites information in our output stream but you cannot filter on them. Take a look at Targets vs Output Data to learn about the difference between filtering on data and consuming that data.
See the Salience augmentation page for details of the other Salience targets that you can include in a CSDL filter.
Salience Entities works with material written in English, French, Spanish, and Portuguese.
Here's an example, in JSON format, of the data you might receive if you activate this augmentation.
The Demographic feed delivers high-level Tweet information with personal details such as the author's user-id, screen-name and avatar removed. It offers an excellent source of detailed data for statistical analysis.
This chart illustrates what happens when you activate the Demographics data source while a stream is running. Assuming that you activated Twitter before you ran the stream, you will initially receive Tweets which include the authors' name, username, and so on, along with the Twitter biography details that authors provide on their Settings page in Twitter. The data will not include detailed demographic data.

If you now activate the Demographics data source, the content that your stream delivers will change immediately. The stream now contains detailed demographic information but interactions are anonymized; they no longer contain names, usernames, or links to a user's profile page on Twitter.
You can switch back to Twitter any time you want, but remember that it is a global setting, applying to all of the streams currently running.
When you enable the Demographic data source, your Twitter data automatically ceases to arrive. To protect privacy, we ensure that Twitter and Demographics data are mutually exclusive. If you are running and recording a Twitter stream, be very careful. As soon as you activate Demographics, you will cease to receive Tweets. This applies to streams that are already running.
To preserve anonymity, we do not reveal Twitter users' ids when you choose to receive Demographics. Instead, we map each Twitter user id to a new id which is available in the hash_id element in the interaction section of the JSON output.
This process enables you to determine whether two messages are from the same author or not. Details of the way the new ids are generated are not available and there is no way to map the new id back to the original Twitter user id.
In this JSON output example, the id element, which is generated internally by DataSift, serves as a unique identifier for this interaction. It cannot be used to identify the author. The hash_id is derived from the original Twitter user id. For example, the Twitter user id for @DataSift is 155505157 but this user id would never appear in its original form in an anonymixed Tweet. In case you're wondering, the hash_id shown above does not correspond to the @Datasift Twitter account.
Here's an example, in JSON format, of the data you receive when you activate the Twitter feed.
The structure of an anonymized Retweet is slightly different. The JSON contains the Retweet family of elements.
The output contains these elements.
| Element | Examples or Description |
|
status |
single, married, parents, engaged, divorced |
|
status work |
working, students, unemployed, retirees |
| type | person, company |
| sex | male, female |
|
age_range start end |
start can be: 0, 17, 20, 25, 30, 35, 40, 50, 60, 70. end can be: 16, 19, 24, 29, 34, 39, 49, 59, 69. |
|
location country |
|
|
location us_state |
The state name is presented in full, not as a 2-character code |
|
location city |
|
|
accounts categories |
actors, musicians, TV celebrities, comedians, news, tech brands, ... |
| likes_and_interests | music, fashion, news, sports, ... |
| first_language | |
| professions | musicians, journalists, artists, programmers, marketers, ... |
| services_and_technologies | Blogger, Tumblr, Wordpress, iPhone, Blackberry, ... |
|
twitter activity |
For example: 1-5 tweets/day |
|
twitter accounts.large |
Lady Gaga, Eminem, TwitPic, CNN News, ... |
|
twitter accounts_followed |
Number of accounts the author follows on Twitter |
|
main_street shop_at |
Ikea, Costco, Walmart, Office Depot, Target, ... |
|
main_street dressed_by |
Victoria's Secret, Burberry, Footlocker, Calvin Klein, Nike, ... |
|
main_street eat_and_drink_at |
McDonald's, Starbucks, Taco Bell, ... |
The Klout Topics augmentation gives an author's main Klout score from Klout.com plus a list of the topics that Klout believes they have a degree of influence over.
See the Klout augmentation page for details of the Klout targets that you can include in a CSDL filter.
Here's an example, in JSON format, of the data you receive when you activate the Klout Topics augmentation.
The NewsCred data source delivers news articles, images, and videos from more than 750 of the world's highest-quality sources, including leading financial and entertainment publications in a fully license-compliant way.
Note The Newscred targets page explains how to filter against this data and describes the content.
Here's an example, in JSON format, of the data you receive when you activate the NewsCred data source. This object describes an image:
This object describes an article:
This object describes a video:
Here's a list of all the output elements your JSON might contain:
2ch->content
2ch->contenttype
2ch->created_at
2ch->domain
2ch->id
2ch->link
2ch->thread
2ch->title
2ch->type
amazon->author->link
amazon->content
amazon->contenttype
amazon->created_at
amazon->domain
amazon->id
amazon->link
amazon->thread
amazon->title
amazon->type
bitly->cname
bitly->country
bitly->country_code
bitly->created_at
bitly->domain
bitly->geo_city
bitly->geo->latitude
bitly->geo->longitude
bitly->geo_region
bitly->geo_region_code
bitly->id
bitly->referring_domain
bitly->referring_url
bitly->share->hash
bitly->timezone
bitly->type
bitly->url
bitly->url_hash
bitly->user->agent
blog->author->link
blog->author->name
blog->content
blog->contenttype
blog->created_at
blog->domain
blog->id
blog->link
blog->post->created_at
blog->post->link
blog->post->title
blog->title
blog->type
board->author->age
board->author->gender
board->author->link
board->author->location
board->author->registered
board->content
board->contenttype
board->created_at
board->domain
board->id
board->link
board->review->recommendation
board->review->ticker
board->thread
board->title
board->type
dailymotion->author->link
dailymotion->author->name
dailymotion->category
dailymotion->content
dailymotion->contenttype
dailymotion->created_at
dailymotion->duration
dailymotion->id
dailymotion->tags
dailymotion->title
dailymotion->type
dailymotion->videolink
demographic->gender
facebook->application
facebook->author->avatar
facebook->author->id
facebook->author->link
facebook->author->name
facebook->caption
facebook->created_at
facebook->description
facebook->id
facebook->likes->count
facebook->link
facebook->message
facebook->name
facebook->og->
facebook->og->by
facebook->og->length
facebook->og->page
facebook->og->par
facebook->og->photos
facebook->og->published time
facebook->source
facebook->type
flickr->author->link
flickr->content
flickr->contenttype
flickr->created_at
flickr->domain
flickr->id
flickr->link
flickr->thread
flickr->title
flickr->type
imdb->author->link
imdb->content
imdb->contenttype
imdb->created_at
imdb->domain
imdb->id
imdb->link
imdb->thread
imdb->title
imdb->type
interaction->author->avatar
interaction->author->contributions
interaction->author->id
interaction->author->link
interaction->author->name
interaction->author->talk
interaction->author->username
interaction->content
interaction->contenttype
interaction->created_at
interaction->geo->latitude
interaction->geo->longitude
interaction->id
interaction->link
interaction->schema->version
interaction->source
interaction->title
interaction->type
klout->score
language->confidence
language->tag
links->meta->keywords
links->meta->opengraph->activity
links->meta->opengraph->author
links->meta->opengraph->cause
links->meta->opengraph->city
links->meta->opengraph->country
links->meta->opengraph->description
links->meta->opengraph->director
links->meta->opengraph->email
links->meta->opengraph->fax_number
links->meta->opengraph->geo->latitude
links->meta->opengraph->geo->longitude
links->meta->opengraph->image
links->meta->opengraph->locality
links->meta->opengraph->musician
links->meta->opengraph->non_profit
links->meta->opengraph->phone_number
links->meta->opengraph->region
links->meta->opengraph->site_name
links->meta->opengraph->sport
links->meta->opengraph->title
links->meta->opengraph->type
links->meta->opengraph->url
links->meta->opengraph->website
links->meta->twitter->app->googleplay->id
links->meta->twitter->app->googleplay->name
links->meta->twitter->app->googleplay->url
links->meta->twitter->app->ipad->id
links->meta->twitter->app->ipad->name
links->meta->twitter->app->ipad->url
links->meta->twitter->app->iphone->id
links->meta->twitter->app->iphone->name
links->meta->twitter->app->iphone->url
links->meta->twitter->card
links->meta->twitter->creator
links->meta->twitter->creator_id
links->meta->twitter->description
links->meta->twitter->image
links->meta->twitter->image_height
links->meta->twitter->image_width
links->meta->twitter->player
links->meta->twitter->player_height
links->meta->twitter->player_stream
links->meta->twitter->player_stream_content_type
links->meta->twitter->player_width
links->meta->twitter->site
links->meta->twitter->site_id
links->meta->twitter->title
links->meta->twitter->url
newscred->article->category
newscred->article->content
newscred->article->domain
newscred->article->fulltext
newscred->article->link
newscred->article->title
newscred->article->topics->name
newscred->id
newscred->image->attribution->text
newscred->image->caption
newscred->image->links->large
newscred->image->links->small
newscred->image->size->height
newscred->image->size->width
newscred->modified_at
newscred->published_at
newscred->source->circulation
newscred->source->company_type
newscred->source->country
newscred->source->domain
newscred->source->founded
newscred->source->frequency
newscred->source->id
newscred->source->link
newscred->source->name
newscred->source->owner
newscred->source->thumbnail
newscred->type
newscred->updated
newscred->video->category
newscred->video->domain
newscred->video->link
newscred->video->thumbnail
newscred->video->title
newscred->video->topics->name
reddit->author->link
reddit->content
reddit->contenttype
reddit->created_at
reddit->domain
reddit->id
reddit->link
reddit->thread
reddit->title
reddit->type
salience->content->entities->about
salience->content->entities->confident
salience->content->entities->evidence
salience->content->entities->label
salience->content->entities->name
salience->content->entities->sentiment
salience->content->entities->type
salience->content->sentiment
salience->content->topics->hits
salience->content->topics->name
salience->content->topics->score
salience->title->entities->about
salience->title->entities->confident
salience->title->entities->evidence
salience->title->entities->label
salience->title->entities->name
salience->title->entities->sentiment
salience->title->entities->type
salience->title->sentiment
salience->title->topics->hits
salience->title->topics->name
salience->title->topics->score
topix->author->link
topix->author->location
topix->author->registered
topix->content
topix->contenttype
topix->created_at
topix->domain
topix->id
topix->link
topix->thread
topix->title
topix->type
twitter->created_at
twitter->filter_level
twitter->geo->latitude
twitter->geo->longitude
twitter->id
twitter->in_reply_to_screen_name
twitter->in_reply_to_status_id
twitter->in_reply_to_user_id
twitter->lang
twitter->media->display_url
twitter->media->expanded_url
twitter->media->id
twitter->media->id_str
twitter->media->media_url
twitter->media->media_url_https
twitter->media->sizes->large->h
twitter->media->sizes->large->resize
twitter->media->sizes->large->w
twitter->media->sizes->medium->h
twitter->media->sizes->medium->resize
twitter->media->sizes->medium->w
twitter->media->sizes->small->h
twitter->media->sizes->small->resize
twitter->media->sizes->small->w
twitter->media->sizes->thumb->h
twitter->media->sizes->thumb->resize
twitter->media->sizes->thumb->w
twitter->media->source_status_id
twitter->media->source_status_id_str
twitter->media->type
twitter->media->url
twitter->place->attributes->locality
twitter->place->attributes->region
twitter->place->attributes->street_address
twitter->place->country
twitter->place->country_code
twitter->place->full_name
twitter->place->id
twitter->place->name
twitter->place->place_type
twitter->place->url
twitter->retweet->count
twitter->retweet->created_at
twitter->retweeted->created_at
twitter->retweeted->geo->latitude
twitter->retweeted->geo->longitude
twitter->retweeted->id
twitter->retweeted->place->attributes->locality
twitter->retweeted->place->attributes->region
twitter->retweeted->place->attributes->street_address
twitter->retweeted->place->country
twitter->retweeted->place->country_code
twitter->retweeted->place->full_name
twitter->retweeted->place->id
twitter->retweeted->place->name
twitter->retweeted->place->place_type
twitter->retweeted->place->url
twitter->retweeted->source
twitter->retweeted->user->created_at
twitter->retweeted->user->description
twitter->retweeted->user->favourites_count
twitter->retweeted->user->followers_count
twitter->retweeted->user->friends_count
twitter->retweeted->user->geo_enabled
twitter->retweeted->user->id
twitter->retweeted->user->id_str
twitter->retweeted->user->lang
twitter->retweeted->user->listed_count
twitter->retweeted->user->location
twitter->retweeted->user->name
twitter->retweeted->user->profile_image_url
twitter->retweeted->user->screen_name
twitter->retweeted->user->statuses_count
twitter->retweeted->user->time_zone
twitter->retweeted->user->url
twitter->retweeted->user->utc_offset
twitter->retweeted->user->verified
twitter->retweet->id
twitter->retweet->lang
twitter->retweet->media->display_url
twitter->retweet->media->expanded_url
twitter->retweet->media->id
twitter->retweet->media->id_str
twitter->retweet->media->media_url
twitter->retweet->media->media_url_https
twitter->retweet->media->sizes->large->h
twitter->retweet->media->sizes->large->resize
twitter->retweet->media->sizes->large->w
twitter->retweet->media->sizes->medium->h
twitter->retweet->media->sizes->medium->resize
twitter->retweet->media->sizes->medium->w
twitter->retweet->media->sizes->small->h
twitter->retweet->media->sizes->small->resize
twitter->retweet->media->sizes->small->w
twitter->retweet->media->sizes->thumb->h
twitter->retweet->media->sizes->thumb->resize
twitter->retweet->media->sizes->thumb->w
twitter->retweet->media->source_status_id
twitter->retweet->media->source_status_id_str
twitter->retweet->media->type
twitter->retweet->media->url
twitter->retweet->source
twitter->retweet->text
twitter->retweet->user->created_at
twitter->retweet->user->description
twitter->retweet->user->favourites_count
twitter->retweet->user->followers_count
twitter->retweet->user->friends_count
twitter->retweet->user->geo_enabled
twitter->retweet->user->id
twitter->retweet->user->id_str
twitter->retweet->user->lang
twitter->retweet->user->listed_count
twitter->retweet->user->location
twitter->retweet->user->name
twitter->retweet->user->profile_image_url
twitter->retweet->user->screen_name
twitter->retweet->user->statuses_count
twitter->retweet->user->time_zone
twitter->retweet->user->url
twitter->retweet->user->utc_offset
twitter->retweet->user->verified
twitter->source
twitter->status
twitter->text
twitter->user->created_at
twitter->user->description
twitter->user->favourites_count
twitter->user->followers_count
twitter->user->friends_count
twitter->user->geo_enabled
twitter->user->id
twitter->user->id_str
twitter->user->lang
twitter->user->listed_count
twitter->user->location
twitter->user->name
twitter->user->profile_image_url
twitter->user->screen_name
twitter->user->statuses_count
twitter->user->time_zone
twitter->user->url
twitter->user->utc_offset
twitter->user->verified
video->author->link
video->author->name
video->category
video->content
video->contenttype
video->created_at
video->duration
video->id
video->tags
video->title
video->type
video->videolink
wikipedia->author->contributions
wikipedia->author->link
wikipedia->author->talk
wikipedia->author->username
wikipedia->body
wikipedia->changetype
wikipedia->created_at
wikipedia->diff->from
wikipedia->diff->htmldiff
wikipedia->diff->to
wikipedia->id
wikipedia->langlinks->lang
wikipedia->langlinks->title
wikipedia->langlinks->url
wikipedia->links->link
wikipedia->links->namespace
wikipedia->links->ns
wikipedia->namespace
wikipedia->newlen
wikipedia->ns
wikipedia->oldlen
wikipedia->pageid
wikipedia->parentid
wikipedia->previousid
wikipedia->sections->anchor
wikipedia->sections->byteoffset
wikipedia->sections->fromtitle
wikipedia->sections->index
wikipedia->sections->level
wikipedia->sections->line
wikipedia->sections->number
wikipedia->sections->toclevel
wikipedia->title
wikipedia->type
youtube->author->link
youtube->author->name
youtube->category
youtube->commentslink
youtube->content
youtube->contenttype
youtube->created_at
youtube->duration
youtube->id
youtube->title
youtube->type
youtube->videolink
The Salience Topics data source gives the list of topics that Salience found in the content or title of a post. See the Salience augmentation page for details of the Salience targets that you can include in a CSDL filter. Note: this augmentation is available only for material written in English.
Read more about Salience Topics.
Here's an example, in JSON format, of the data you receive when you activate the Salience Topics data source.
The Wikipedia feed delivers information relating to edits and new contributions at Wikipedia.org.
Note The Wikipedia targets page explains how to filter against this data and describes the content.
Here's an example, in JSON format, of the data you receive when you activate the Wikipedia feed.
DataSift's topics analysis engine independently analyzes the content and the title of a post and derives a list of topics for each. The topics are:
| Advertising | Disasters | Investing | Social Media |
| Agriculture | Economics | Labor | Software and Internet |
| Art | Education | Law | Space |
| Automotive | Elections | Marriage | Sports |
| Aviation | Fashion | Mobile Devices | Technology |
| Banking | Food | Politics | Traditional Energy |
| Beverages | Hardware | Real Estate | Travel |
| Biotechnology | Health | Renewable Energy | Video Games |
| Business | Hotels | Robotics | War |
| Crime | Intellectual Property | Science | Weather |
Any post might contain multiple topics. For example, a Tweet such as "I'm flying to the K-12 Science Fair in Seattle this morning" might match against "Education" and "Science" and "Travel".
The Salience topics are derived from an analysis of the entire content in Wikipedia™. Each Salience topic is defined by a list of keywords. For example, Aviation is defined as "aviation, airplane, flying".
Here are some more examples:
To learn more, take a look at Salience's documentation on topics.
The Trends augmentation gives details of trending topics on Twitter.
Here's an example, in JSON format, of the data you receive when you activate the Trends augmentation.
As DataSift's platform and Historics product have developed, the Historical Archive's schema has evolved from just unaugmented Tweets originally to the wide range of augmented data sources you have access to today. We have broken this evolving schema down by the dates on which the schema changed.
The existing schema versions are:
August 2012 to October 2012
November 2012 to present
Pre December 2011
This early portion of our Historics archive contains data that is only partially augmented. So, if you are filtering purely on an augmentation, you may not receive results. For example, if the interactions in the timeframe you are querying have not had the links augmentation applied, this filter will not return data:
This Archive does not include the following:
December 2011 to April 2012
During this period the schema is subject to change. Many new augmentations were added, and some were modified. Please ensure your DataSift stream consumer is capable of dealing with cases where unexpected fields may exist.
July 2012
During this period the schema is subject to change. Please ensure your DataSift stream consumer is capable of dealing with cases where unexpected fields may exist.
We offer client libraries for the most popular programming languages.
Click to find out how to download a library, check which features it supports, and see step-by-step examples of how to benefit from our code.
Our engineering team is always looking at new candidate languages. If your favorite language does not yet have a DataSift client library, contact Support for the latest news.
We're happy to see that the DataSift development community has already started to add to the set of libraries we provided at launch.
A key point to understand is that targets are not the same as output data:
You will encounter three different cases: when data is available in both a target and in the output, when it is available only as a target, and when it is available only in the output.
Take a look at our complete list of data elements in your output streams.
The most common case is when information is available both as a target and in the output. For example, when you filter for people on Twitter who have more than 1,000,000 followers, the output that you receive will include their exact follower count.
twitter.user.followers_count > 1000000
In this CSDL code, twitter.user.followers_count is the target.
You can see the output element, in JSON format, in the following code snippet:
In this example, your CSDL filter tells you nothing about the content that you will receive. The filter simply finds all Tweets posted by people with a large number of followers.
You will see cases where information is available as a target so you can filter on it, but it is not in the output. For example:
interaction.sample < 10
Finally, you will find cases where information is not available as a target, so it is not filterable, but it does appear in the output. For example: the created_at field is often presented in the output data but it is not a target.
If you enable the Salience augmentation, you will find that the Salience Entites data is delivered but you cannot filter against it.

DataSift is very simple to use. You simply build filters to find the information you want.
If you're new to the product, we suggest you read these pages first.
1. The filters are written in our programming language. It's called the Curated Stream Definition Language (CSDL).
2. The key feature of CSDL is its set of operators.
3. The filters take information from social media sites and augmentations. Together these are called targets.
4. You can write applications that access DataSift through our APIs.
5. You can test your filters in our user interface before you start to explore the APIs.
6. Please read our Understanding Billing guide and Billing FAQ.
7. Please read Targets vs Output Data to understand the difference between filtering data and consuming data.