Language Guide

Introduction

DataSift allows you to filter in real time for what you need from the torrent of data flooding out of social media sites. You can sift by specifying conditions on any of the attributes present in the raw data objects, including all their metadata. You can refine the process by adding conditions on an array of supplementary features relating to profile, context, and other analysis functions such as sentiment and language. Furthermore you can combine elementary conditions or complete rules by logical composition to whatever degree of complexity you require.

To make the expression of your filter clear and concise we have developed our own curation language called the Curated Stream Definition Language, CSDL. Using the CSDL you can express powerful filters of unlimited precision and you can even express your own rules to enrich your selected data with your own feature tags.

Here's an example, just for illustration, of a complex filter that you could build with only four lines of CSDL code: imagine that you want to look at information from Twitter that mentions the iPad. Suppose you want to include content written in English or Spanish but exclude any other languages, select only content written within 100 kilometers of New York City, and exclude Tweets that have been retweeted fewer than five times. You can write that in just four lines of CSDL!

DataSift's programming language, the Curated Stream Definition Language, allows you to:

  • define filtering conditions that specify what information your stream will include
  • augment the objects in the stream with data from third-party sites (for example, adding the author's gender)
  • augment the objects in the stream based on analysis of the content (for example, detecting positive or negative sentiment)
  • augment the objects in the result stream with metadata tags according to your own rules (for example, adding an "Apple" tag to objects that include mention of "iPad" or "Mac").

A simple filtering condition usually has three elements.

    target + operator + argument

Target

The target shows DataSift where to look for the information you need. For example, the target might be the 140-character string containing a Tweet, the sentiment, positive or negative, expressed in some Myspace content, or the language in which a post is written.

Operator

The operator defines the comparison Datasift should make. For example, it might search for Tweets from authors who have at least 100 followers. The operator here would be greater than. It might search for a Tweet that mentions football. The operator in this case would be contains. Or it might search for messages that include geographical information. This would employ the exists operator.

More complex filters can be built by combining simple filters using the logical operators: AND, OR, and NOT.

Once a result stream has been defined through a filter, you can reuse it as part of the definition of another stream by using the stream keyword to combine it with further filtering conditions.

To add your own metadata tags, use the tag keyword.

Argument

The argument can be any value, as long as the type of the argument matches the type of the target.

Documents

Tweet: 

Advanced Features

CSDL comes with a range of advanced features that allow much more complex filtering through the ability to include parent streams and also use our data processing to perform 'tagging' of content to remove the burden of doing it on the client side.

There's only one limit and most developers are unlikely to run into it: the maximun length of the CSDL code in your filter must not exceed 1 MB. Fortunately, there's an easy workaround that employs the stream keyword. If your stream reaches the point where it exceeds 1 MB, you can easily have it call one or more other streams. In this way, you can distribute your code across many chunks of CSDL code. 

Here are some additional keywords and techniques you can employ to optimize streams, reduce costs, and create even more complex and powerful filters.

  

Tag keyword  

Use the tag keyword to add metadata to the output objects coming from a stream. Advanced CSDL developers employ the tag keyword very frequently because it makes subsequent analysis easy.

 

Stream keyword

Use the stream keyword to include an existing stream in a new stream definition or merge two or more streams.

 

Regular expressions

Use regular expressions to create super-powerful stream definitions.

 

Selecting data sources  

You can select or exclude individual social media sites in a variety of ways. 

 

Optimization

You can optimize your CSDL code to run more efficiently and to minimize costs.

 

Sampling data

You can sample data rather than drinking from the entire firehose. For instance, you can create a filter that samples just a percentage of the input objects flowing into DataSift. This approach is particularly useful if you're performing statistical analysis where, for example, just 10 percent of the data might be enough to form a representative sample.

 

 

Tweet: 

Selecting or Excluding Data Sources

There are two simple ways to choose which sites to accept data from: either name the sources explicitly or use the Common target.

Name the Sources

You can name the source explicitly; for example:

This works for multiple sources but it can become cumbersome:

Use a Common Target

Include sources with interaction.type:

Exclude sources with interaction.type:

Tweet: 

Tag keyword

The tag keyword in CSDL allows you to add metadata to objects in DataSift. Imagine you are running a stream to look for news about new car launches. You can add an "Audi" tag to all the objects that mention Audi, "Jeep" to all the objects that mention Jeeps, and so on.

Tags make subsequent processing easier. Just as augmentations enrich objects based on information supplied by third-party sites, tags enrich objects according to rules you define in your CSDL code.

To define a tag, write the condition in braces and use the tag keyword followed by the name of the associated attribute like this.

    tag "Audi" { interaction.content contains "Audi" }

Then enclose the body of the filter in a return statement:

    return {

       interaction.content contains_any "Detroit auto show,

                                         New York auto show,

                                         Geneva motor show"

    }

There is no limit to the number of tags you can use.

You can also add tags to existing streams or streams defined by other people. For instance:

    tag "APPLE" {stream "f0596f03644177b8bf5b59708a08bfe8"}

    return {

        stream "f0596f03644177b8bf5b59708a08bfe8"

    }

However, tags do not cascade. If that original stream had tags, they will not be present in your output stream. The only tag that you will see in output objects is the "APPLE" tag shown in the code above. In other words,only the top-level tags will be available for further processing.

Tags are part of the interaction object in a Tweet

When you add tags to your CSDL, they can increase your billing for that stream slightly.

Examples

1.   Filter for posts that mention "Apple" and add multiple tags.

2.   Tags are also useful when you merge several streams into one master stream. You can add a tag to each input object to identify which stream they originated from.

3.   Tags do not cascade. For example, we wrote this stream and DataSift gave it a hash of f2625cb00dcbbe4121b654d9d3e1f3aa

This stream finds all objects that mention Apple, Mac, iPad, or Steve Jobs. For the ones that contain Steve Jobs, it adds a "Steve Jobs" tag.

We can use the stream in a new stream and add new tags like this:

However, because tags do not cascade, the only tags In the output will be "test".

Tweet: 

Stream keyword

Purpose

The stream keyword allows you to:

  • include an existing stream in a new stream definition.
  • merge two or more streams.

The syntax is:

    stream <hash>

where <hash> is the DataSift hash that identifies the stream to be included.

Synonym

    rule <hash>

Examples

1.  Take an existing stream and add an extra rule:

2.   Merge three streams into one master stream and then add tags to identify which objects come from which streams.

Keywords: 
Tweet: 

Optimization

CSDL stream definitions can be complex. There are numerous ways to optimize them to increase speed and reduce costs.

Combining Multiple Filters into one "in" Statement

If you are testing several values against a single target and you are not doing anything very complex, it may be worth combining them together into an in statement to reduce costs and processing overheads. The in operator is heavily optimized.

For example, this stream definition:

could be much cheaper to run by using an in statement instead:

Or you can use the operator to combine multiple Twitter ids like this:

Combining Multiple Filters into one "contains_any" Statement

It is possible to optimize multiple contains operations combined with OR operators into a contains_any operation.

For example, this stream definition:

can be optimized like this:

Keywords: 
Tweet: 

Commenting

You can add comments to your CSDL code using a style similar to the syntax used in C.

Comment out entire lines like this:

Comment out part of a line like this:

Or add a mid-line comment like this: 

Note: in most cases a stream's hash changes whenever you edit the CSDL for that stream. However, adding comments to CSDL does not cause the hash to change because comments are normalized out when the stream is compiled.

CSDL accepts comments anywhere it accepts whitespace. Don't insert comments within quoted strings as the CSDL parser disables white space skipping here

Whitespace in the CSDL compiler

Our definition of whitespace is derived from the isspace() function in C. We use:

  • space
  • tab
  • newline
  • feed
  • carriage return

Word Matching using Contains/Any

DataSift tokenizes each interaction target. It ignores white space but treats every punctuation symbol (as defined in ispunct()) as a separate word.

For example, if the input is:

    This is,    a test    

The output is:

    <This> <is> <,> <a> <test>

Using this technique we can match words without allowing punctuation to affect the boundry of what is considered a word, yet still allows you to include punctuation in filter when you want to, as the punctuation is not stripped from the text.

Keywords: 
Tweet: 

Tokenization and the CSDL Engine

This page is for advanced DataSift users. The filtering engine that runs your CSDL code is tuned for efficiency in order to handle massive volumes of data. To accomplish this, it performs some preliminary steps, temporarily stripping all @mentions and URLs from posts and then tokenizing the remaining content into "chunks" before it performs any filtering.

Let's look at these processes in turn.

@Mentions

DataSift's filtering engine ignores all mentions of Twitter users in Tweets. Therefore, it would see this Tweet:

    If you like @DataSift you have to follow @DataSiftNews and @DataSiftDev every day.

as this:

    If you like you have to follow and every day.

Of course, this means that you cannot write a CSDL filter like this to look for mentions of @DataSift:

 

Instead, use the twitter.mentions target like this:

 

 

URLs

URLs are also invisible to the filtering engine. DataSift sees this Tweet:

    If you like @News surf to http://nytimes.com.

as this:

    If you like surf to.

Consequently, you cannot write a CSDL filter like this one for URLs:

 

Instead, you need to use a Links Augmentation target like this:

 

 

Cashtags

DataSift uses 'cashtags' to allow you to filter for stock ticker symbols; for example:

This works because we have chosen to treat the $ symbol is treated as an alphabetical symbol, not punctuation.

 

Chunking

The engine also separates the characters it sees into "chunks". Characters can be:

  • alphanumeric
  • whitespace
  • punctuation

Let's look at alphanumeric and whitespace characters first. Every time DataSift finds one or more whitespace characters between alphanumeric strings, it starts a new chunk. For example, it would split this string into three chunks:

    The Polar Express

The chunks are:

  • The
  • Polar
  • Express

It also treats punctuation as a chunk so, if we add a period at the end of out text, DataSift would see four chunks:

    The Polar Express.

Now the chunks are:

  • The
  • Polar
  • Express
  • .

DataSift it groups alphanumeric characters together but it treats punctuation characters individually, so it would divide this Tweet:

    The Polar Express...

comprises six chunks; the final three of them are simply periods:

  • The
  • Polar
  • Express
  • .
  • .
  • .

Punctuation is:

  Name Symbol
         Exclamation mark !
  Double quotes "
  Hash #
  Percent %
  Ampersand &
  Single quote '
  Open parenthesis (
  Close parenthesis )
  Asterisk *
  Plus +
  Comma ,
  Dash -
  Period .
  Forward slash /
  Colon :
  Semicolon ;
  Less than <
  Equals =
  Greater than >
  Question mark ?
  At @
  Open square bracket [
  Backslash \
  Close square bracket ]
  Carat ^
  Underscore _
  Backtick or grave accent `
  Open curly bracket {
  Pipe |
  Close curly bracket }
  Tilde ~
  Left double quote
  Reversed double prime quote ‵‵
  CJK reversed double prime quote
  Right double quote
  Double prime quote
  CJK double prime quote
  Low double prime quote
  CJK ditto mark
  CJK left corner bracket
  Left ceiling
  CJK right corner bracket
  Right floor
  CJK left white corner bracket
  CJK right white corner bracket

 

How does chunking affect my filters?

Suppose that you want to search for Tweets that contain "50%"; that is, the number 50 followed immediately by a percent sign. The string would be represented by two chunks:

  • 50
  • %

Thus, a filter such as this one:

would match "50%" but it would also match "50 %" (note the inclusion of a space) because DataSift recognizes chunks even if they have whitespace between them.

In other words, the filter would match any of these Tweets:

 

    I scored 50%

    I bought 50 %AWESOME%

    50 %25

 

So the filter would return false positives. How can we get around this problem? How can we construct a better filter? Using contains and substr together we can build a robust, reliable filter like this:

Why can't we use substr alone? Because it would return any tweet that contained 50%, 150%, 250%, and so on.

Keywords: 
Tweet: