Introduction

DataSift allows you to filter in real time for what you need from the torrent of data flooding out of social media sites. You can sift by specifying conditions on any of the attributes present in the raw data objects, including all their metadata. You can refine the process by adding conditions on an array of supplementary features relating to profile, context, and other analysis functions such as sentiment and language. Furthermore you can combine elementary conditions or complete rules by logical composition to whatever degree of complexity you require.

To make the expression of your filter clear and concise we have developed our own curation language called the Curated Stream Definition Language, CSDL. Using the CSDL you can express powerful filters of unlimited precision and you can even express your own rules to enrich your selected data with your own feature tags.

Here's an example, just for illustration, of a complex filter that you could build with only four lines of CSDL code: imagine that you want to look at information from Twitter that mentions the iPad. Suppose you want to include content written in English or Spanish but exclude any other languages, select only content written within 100 kilometers of New York City, and exclude Tweets that have been retweeted fewer than five times. You can write that in just four lines of CSDL!

Filtering

CSDL allows you to:

  • define filtering conditions that specify what information your stream will include
  • augment the objects in the stream with data from third-party sites (for example, adding the author's gender)
  • augment the objects in the stream based on analysis of the content (for example, detecting positive or negative sentiment)
  • augment the objects in the result stream with metadata tags according to your own rules (for example, adding an "Apple" tag to objects that include mention of "iPad" or "Mac").

A simple filtering condition usually has three elements.

    target + operator + argument

The target shows DataSift where to look for the information you need. For example, the target might be the 140-character string containing a Tweet, the sentiment, positive or negative, expressed in some Myspace content, or the language in which a post is written.

The operator defines the comparison Datasift should make. For example, it might search for Tweets from authors who have at least 100 followers. The operator here would be greater than. It might search for a Tweet that mentions football. The operator in this case would be contains. Or it might search for messages that include geographical information. This would employ the exists operator.

More complex filters can be built by combining simple filters using the logical operators: AND, OR, and NOT.

Once a result stream has been defined through a filter, you can reuse it as part of the definition of another stream by using the stream keyword to combine it with further filtering conditions.

The argument can be any value, as long as the type of the argument matches the type of the target.

NoteCSDL allows you to create large and powerful filters. Each one can be up to 1MB.

Tagging

Tagging, enabled by DataSift VEDO, allow you to add your own metadata to interactions. For example, brokers monitoring a set of stocks could filter for the names of any of the companies involved and add the appropriate stock ticker as a tag.

Adding tags based on conditional rules gives you the power to code business rules directly into your CSDL code. Since you are free to choose your own tags, you can build taxonomies to fit your business, specific to the problem you're working on. Tags allow you to deliver interactions with custom metadata to your applications.