Language Guide

Introduction

DataSift allows you to filter in real time for what you need from the torrent of data flooding out of social media sites. You can sift by specifying conditions on any of the attributes present in the raw data objects, including all their metadata. You can refine the process by adding conditions on an array of supplementary features relating to profile, context, and other analysis functions such as sentiment and language. Furthermore you can combine elementary conditions or complete rules by logical composition to whatever degree of complexity you require.

To make the expression of your filter clear and concise we have developed our own curation language called the Curated Stream Definition Language, CSDL. Using the CSDL you can express powerful filters of unlimited precision and you can even express your own rules to enrich your selected data with your own feature tags.

Here's an example, just for illustration, of a complex filter that you could build with only four lines of CSDL code: imagine that you want to look at information from Twitter that mentions the iPad. Suppose you want to include content written in English or Spanish but exclude any other languages, select only content written within 100 kilometers of New York City, and exclude Tweets that have been retweeted fewer than five times. You can write that in just four lines of CSDL!

Filtering

CSDL allows you to:

  • define filtering conditions that specify what information your stream will include
  • augment the objects in the stream with data from third-party sites (for example, adding the author's gender)
  • augment the objects in the stream based on analysis of the content (for example, detecting positive or negative sentiment)
  • augment the objects in the result stream with metadata tags according to your own rules (for example, adding an "Apple" tag to objects that include mention of "iPad" or "Mac").

A simple filtering condition usually has three elements.

    target + operator + argument

The target shows DataSift where to look for the information you need. For example, the target might be the 140-character string containing a Tweet, the sentiment, positive or negative, expressed in some Myspace content, or the language in which a post is written.

The operator defines the comparison Datasift should make. For example, it might search for Tweets from authors who have at least 100 followers. The operator here would be greater than. It might search for a Tweet that mentions football. The operator in this case would be contains. Or it might search for messages that include geographical information. This would employ the exists operator.

More complex filters can be built by combining simple filters using the logical operators: AND, OR, and NOT.

Once a result stream has been defined through a filter, you can reuse it as part of the definition of another stream by using the stream keyword to combine it with further filtering conditions.

The argument can be any value, as long as the type of the argument matches the type of the target.

 

NoteCSDL allows you to create large and powerful filters. Each one can be up to 1MB.

 

Tagging

Tagging, enabled by DataSift VEDO, allow you to add your own metadata to interactions. For example, brokers monitoring a set of stocks could filter for the names of any of the companies involved and add the appropriate stock ticker as a tag.

Adding tags based on conditional rules gives you the power to code business rules directly into your CSDL code. Since you are free to choose your own tags, you can build taxonomies to fit your business, specific to the problem you're working on. Tags allow you to deliver interactions with custom metadata to your applications.

 

Tweet: 

Advanced Features

CSDL comes with a range of advanced features that allow much more complex filtering through the ability to include parent streams and also use our data processing to perform 'tagging' of content to remove the burden of doing it on the client side.

There's only one limit and most developers are unlikely to run into it: the maximun length of the CSDL code in your filter must not exceed 1 MB. Fortunately, there's an easy workaround that employs the stream keyword. If your stream reaches the point where it exceeds 1 MB, you can easily have it call one or more other streams. In this way, you can distribute your code across many chunks of CSDL code. 

Here are some additional keywords and techniques you can employ to optimize streams, reduce costs, and create even more complex and powerful filters.

  

Tag keyword  

Use the tag keyword to add metadata to the output objects coming from a stream. Advanced CSDL developers employ the tag keyword very frequently because it makes subsequent analysis easy.

 

Stream keyword

Use the stream keyword to include an existing stream in a new stream definition or merge two or more streams.

 

Regular expressions

Use regular expressions to create super-powerful stream definitions.

 

Selecting data sources  

You can select or exclude individual social media sites in a variety of ways. 

 

Optimization

You can optimize your CSDL code to run more efficiently and to minimize costs.

 

Sampling data

You can sample data rather than drinking from the entire firehose. For instance, you can create a filter that samples just a percentage of the input objects flowing into DataSift. This approach is particularly useful if you're performing statistical analysis where, for example, just 10 percent of the data might be enough to form a representative sample.

 

Tweet: 

Selecting or Excluding Data Sources

There are two simple ways to choose which sites to accept data from: either name the sources explicitly or use the Common target.

Name the Sources

You can name the source explicitly; for example:

This works for multiple sources but it can become cumbersome:

Use a Common Target

Include sources with interaction.type:

Exclude sources with interaction.type:

Tweet: 

Stream keyword

Purpose

The stream keyword allows you to:

  • include an existing stream in a new stream definition.
  • merge two or more streams.

The syntax is:

    stream <hash>

where <hash> is the DataSift hash that identifies the stream to be included.

Examples

1.  Take an existing stream and add an extra rule:

2.   Merge three streams into one master stream and then add tags to identify which objects come from which streams.

Keywords: 
Tweet: 

Optimization

CSDL stream definitions can be complex. There are numerous ways to optimize them to increase speed and reduce costs.

Combining Multiple Filters into one "in" Statement

If you are testing several values against a single target and you are not doing anything very complex, it may be worth combining them together into an in statement to reduce costs and processing overheads. The in operator is heavily optimized.

For example, this stream definition:

could be much cheaper to run by using an in statement instead:

Or you can use the operator to combine multiple Twitter ids like this:

Combining Multiple Filters into one "contains_any" Statement

It is possible to optimize multiple contains operations combined with OR operators into a contains_any operation.

For example, this stream definition:

can be optimized like this:

Keywords: 
Tweet: 

Classic Punctuation Character Set

The Classic punctuation character set is:

Tweet: 

Extended Punctuation Character Set

The Extended punctuation character set contains all members of the Unicode punctuation families of characters: Pc, Pd, Pe, Pf, Pi, Po, Ps, Sk, Sm, and So.

Tweet: 

CSDL Notes

Using != and NOT

CSDL offers two different methods of negation. One uses the != operator:

    twitter.source != "web"

The other uses the NOT logical operator:

    NOT twitter.source == "web"

 

At first glance, these appear to perform identically. However, there is one important difference. The twitter.source target is not always populated and we need to consider what happens in DataSift's filtering engine if this target is missing. It's easy to check which targets are always populated and which ones might sometimes be unpopulated. Just go to the documentation for the target you're using and look in the top right corner of the page:

 

Inside the filtering engine, an operator returns a value of True when it finds a match or False otherwise. If, overall, the result is True for an interaction, we deliver that interaction to you. Suppose, for a particular Tweet, twitter.source exists and contains the value "iPhone". In this case:

This filter: Returns this value internally:
twitter.source != "web" True
twitter.source == "web" False
NOT twitter.source == "web" True

Now, let's look at what happens when twitter.source is unpopulated: 

This filter: Returns this value internally:
twitter.source != "web" False
twitter.source == "web" False
NOT twitter.source == "web" True

Since twitter.source does not exist in this interaction, the != operator always returns a False value. Clearly, the two filters are different. So, for targets that might sometimes be unpopulated, it's best to filter using the != operator.

This behavior is not restricted to the != operator, of course. You would see the same problem with the contains operator, for example. The twitter.user.description target holds a user's 160-character biography but it can be blank. The following filter appears to match every Tweet from a user who does not include the word "data" in their bio but, in fact, it also matches Tweets from users who leave their bio blank:

    NOT twitter.user.description contains "data"

Tweet: 

Commenting

You can add comments to your CSDL code using a style similar to the syntax used in C.

Comment out entire lines like this:

Comment out part of a line like this:

Or add a mid-line comment like this: 

Note: in most cases a stream's hash changes whenever you edit the CSDL for that stream. However, adding comments to CSDL does not cause the hash to change because comments are normalized out when the stream is compiled.

CSDL accepts comments anywhere it accepts whitespace. Don't insert comments within quoted strings as the CSDL parser disables white space skipping here

Whitespace in the CSDL compiler

Our definition of whitespace is derived from the isspace() function in C. We use:

  • space
  • tab
  • newline
  • feed
  • carriage return

Word Matching using Contains/Any

DataSift tokenizes each interaction target. It ignores white space but treats every punctuation symbol (as defined in ispunct()) as a separate word.

For example, if the input is:

    This is,    a test    

The output is:

    <This> <is> <,> <a> <test>

Using this technique we can match words without allowing punctuation to affect the boundry of what is considered a word, yet still allows you to include punctuation in filter when you want to, as the punctuation is not stripped from the text.

Keywords: 
Tweet: 

Tagging and Scoring

Tagging and scoring, enabled by DataSift VEDO, allow you to add your own metadata to interactions. A simple example might be brokers monitoring a set of stocks could filter for the names of any of the companies involved and adding the appropriate stock ticker as a tag.

You can view pre-built examples of tags in our library. They're ready to use.

Tagging brings significant benefits because you can:

  • use the full power of CSDL, adding tags based on conditional rules. This allows you to code business rules directly into your CSDL code.
  • choose tags that are meaningful to your business and to the specific situation you are building a filter for. In this way you can build taxonomies to fit your business.
  • deliver interactions with custom metadata to your applications.
  • use scoring to create powerful decision-making logic.
  • reuse definitions across projects; write once, use anywhere.

Tagging takes place immediately after filtering in DataSift. It's very easy to use and there are many different ways you can employ it.

This page: Explains how to:
Applying tags Add tags to a filter.
Applying scores Add scores to interactions and to aggregate the results of the scores.
Adding namespaces Structure your tagging into namespaces.
Reusing tag definitions Reuse and combine tag definitions.

 

Tweet: 

Adding Namespaces

Tags can belong to namespaces, which are analogous to folders in a file system: they provide a way for you to add structure and grouping, they can fit inside each other as many levels deep as you want, and you can call them whatever you choose.

Here's an example that does not have any namespaces yet:

You can modify it like this to place all the tags in a namespace called "device":

If an input object looks like this:

The output will look like this:

 

For reasons of brevity, these JSON objects are shown here as fragments with only the relevant elements visible.

Using multiple levels of namespacing

You can use as many levels of namespace as you want.

For this input:

The output will be:

 

Adding a namespace when you include tags

The previous example showed how to specify a namespace at the time you defined tags. You can also add a namespace to a set of tag definitions when you include them in a filter. This tag definition does not have any namespaces yet:

The hash of this tag definition is 7fe07f1488f9bfb0c251a1a79a4ec6fa. To add these predefined tags to a namespace called "device.model" at the time you include the tag definitions:

It does not matter whether we add it at the time we define the tags or when we include them. The resulting JSON is the same.

 

Adding a namespace to imported tags that already have namespaces

This tag definition belongs to a namespace called "model":

You can include it in a namespace called "device" this this:

 

The JSON output combines both namespaces. DataSift places tags into the "model" namespace and places that in the "device" namespace. Again the output is:

 

Note the order of the namespacing here: tag.device.model.

In other words, it's just as if you had defined your original tags like this:

 

Tweet: 

Applying Scores

When you apply tags to interactions, you're adding text labels as metadata. Alternatively, you can use scoring to add numerical values, again based on one or more conditional tests. For example, here are four tag definitions in the same namespace:

 

All of these scoring rules affect the final value of tag.sales.lead_score, either adding to it or subtracting from it. For instance, suppose an interaction contains this text:

    I might buy one, but they look quite expensive

The first of these scoring rules matches "buy" so we add 20 to the score. The second rule does not match but the third matches "expensive" and so we subtract 5. Since the last rule does not match anything, the overall score is 15.

In this case, all the scores are integer values but they are stored as floating point numbers so you can use decimals in your CSDL. We support accuracy to 13 significant figures, although numbers are held internally to greater accuracy to ensure there are no rounding errors.

 

Understanding the output

If we apply tags to these two input objects:

The output looks like this:

For reasons of brevity, these JSON objects are shown here as fragments with only the relevant elements visible.

Tweet: 

Applying Tags

Tagging allows you to add your own metadata to interactions before they reach your application.

To add tags to any CSDL filter, first wrap the filtering code in a return statement. Then add the tagging code. This has to appear before the return otherwise the filter will not compile. The tags can be any string values you choose:

 

Defining tags

You can apply any tag to an interaction just once. For example, this code is fine:

But this code will not compile because it attempts to assign a value to the same tag twice:

Instead, the compiler will insist on code such as this, which assigns the tag value only once:

 

Understanding the output

For the simple tagging described on this page, if the input contains these two interactions:

 

The output looks like this. Notice that the tags appear as an array of strings beneath interaction.tags:

For reasons of brevity, these JSON objects are shown here as fragments with only the relevant elements visible.

 

How many tags can I use?

You can currently use up to 10,000 tags in each of your filters. It does not matter whether you define them locally or include them from external definitions. We may adjust the limit from time to time, either up or down, depending on the load on DataSift's platform.

 

Tweet: 

Reusing Tag Definitions

You can define your tags separately from your filters. For example, you can define a set of tags once and then reuse them in multiple filters, or use several different sets of tags in one filter.

Including tags in a new filter

To include a set of tags in a filter:

  1. Create standalone tag definitions using the tag keyword.
  2. Compile the tags.
  3. Make a note of the hash for that definition.
  4. Write your filter.
  5. Use the tags keyword to Include them in the CSDL code that defines a filter.
  6. Compile the filter, either in DataSift's user interface or via an API call to the /compile endpoint.

 

Here's an example of a tags definition:

There is no return statement here and no filtering logic. These tags are self-contained and they will work with any filter.

The hash for this tag definition is a8979ef1255a7d511998553f2f22819d and we can include it in another filter using the tags keyword.

The tags keyword has to appear before the return statement.

If you change the tag definition in any way, its hash will change. At that point, you can either continue to use the old hash, and hence the old definition of the tags, or use our new tag definitions by updating the hash in the tags command:

    tags "15068d2476bedd162530702700d63547"

Make sure that any definitions you include using the tags keyword contain nothing but tag definitions; you cannot use this keyword to include code that has a filter definition.

 

Including tags into a filter that already has tags

You can use the tags keyword to include tags in a filter than already has tags defined locally. DataSift applies both sets of tags to the interactions.

 

 

Including tags in multiple filters

You can include a tag definition in any number of filters.

You can include the tags here:

You can use them here too:

Notice that this filter includes external tag definitions and also has local tag definitions. If you need to do this, include the external tags before the local tag definitions.

 

Including multiple sets of tag definitions

You can include more than one set of tag definitions; for example, if you define these tags:

And these tags:

You can include both sets of definitions in one filter:

Tweet: 

Tokenization and the CSDL Engine

To make DataSift's filtering engine as accurate as possible, we perform a number of pre-processing steps such as chunking, pre-indexing, and detection of hashtags and cashtags in Tweets. You can customize some of those processes to tune DataSift to match your needs.

Here's what you need to know about:

Keywords: 
Tweet: 

Chunking

Chunking

The engine also separates the characters it sees into "chunks". Characters can be:

  • alphanumeric
  • whitespace
  • punctuation

Let's look at alphanumeric and whitespace characters first. Every time DataSift finds one or more whitespace characters between alphanumeric strings, it starts a new chunk. For example, it would split this string into three chunks:

    The Polar Express

The chunks are:

  • The
  • Polar
  • Express

It also treats punctuation as a chunk so, if we add a period at the end of out text, DataSift would see four chunks:

    The Polar Express.

Now the chunks are:

  • The
  • Polar
  • Express
  • .

DataSift it groups alphanumeric characters together but it treats punctuation characters individually, so it would divide this Tweet:

    The Polar Express...

comprises six chunks; the final three of them are simply periods:

  • The
  • Polar
  • Express
  • .
  • .
  • .

Punctuation is:

  Name Symbol
         Exclamation mark !
  Double quotes "
  Hash #
  Percent %
  Ampersand &
  Single quote '
  Open parenthesis (
  Close parenthesis )
  Asterisk *
  Plus +
  Comma ,
  Dash -
  Period .
  Forward slash /
  Colon :
  Semicolon ;
  Less than <
  Equals =
  Greater than >
  Question mark ?
  At @
  Open square bracket [
  Backslash \
  Close square bracket ]
  Carat ^
  Underscore _
  Backtick or grave accent `
  Open curly bracket {
  Pipe |
  Close curly bracket }
  Tilde ~
  Left double quote
  Reversed double prime quote ‵‵
  CJK reversed double prime quote
  Right double quote
  Double prime quote
  CJK double prime quote
  Low double prime quote
  CJK ditto mark
  CJK left corner bracket
  Left ceiling
  CJK right corner bracket
  Right floor
  CJK left white corner bracket
  CJK right white corner bracket

 

How does chunking affect my filters?

Suppose that you want to search for Tweets that contain "50%"; that is, the number 50 followed immediately by a percent sign. The string would be represented by two chunks:

  • 50
  • %

Thus, a filter such as this one:

would match "50%" but it would also match "50 %" (note the inclusion of a space) because DataSift recognizes chunks even if they have whitespace between them.

In other words, the filter would match any of these Tweets:

 

    I scored 50%

    I bought 50 %AWESOME%

    50 %25

 

So the filter would return false positives. How can we get around this problem? How can we construct a better filter? Using contains and substr together we can build a robust, reliable filter like this:

Why can't we use substr alone? Because it would return any tweet that contained 50%, 150%, 250%, and so on.

Tweet: 

Controlling Case Sensitivity

There are two ways to control case sensitivity in CSDL:

  • The cs modifier
  • The case switch

 

Both of these work at the operator level, giving you very fine control over the way your filters operate. The first example on this page uses a combination of a filter that is case sensitive and one that is not.

If you turn case sensitivity on, the CSDL engine will pay attention to the case of your argument. For example, if you filter for "Polish" with case sensitivy turned on, the engine will ignore "polish". If you turn case sensitivity off, the engine ignores the case of your argument.

cs

Here's an example with the cs modifier:

Case

Alternatively, you can use the case switch. For example:

Summary

There are a few key points to keep in mind:

  • Set case to "true" if you need your filter to be case sensitive.
  • If you omit the case switch or if you set it to "false", the operator will treat uppercase and lowercase identically.
  • If cs is present, the operator will use case-sensitive mode regardless of the setting of the case switch.
Tweet: 

Controlling Punctuation Characters

The contains, contains_any, and contains_near operators can take an optional switch called keep. It gives you very great control over which characters you want DataSift to treat as punctuation.

The keep switch allows you to choose which characters are treated as punctuation. We provide the following pre-configured punctuation sets:

If DataSift treats a character as punctuation, the tokenizer will see it as a separate word. For example, a character called the Arabic full stop is a member of the extended punctuation character set. So, if your CSDL includes [keep(extended)], and you are filtering for abcd followed by an Arabic full stop, the tokenzier will see that as two words; abcd is the first word, and the Arabic full stop is the second word.

On the other hand, if you select [keep(classic)] in your code, DataSift will not treat the Arabic full stop as punctuation because it is not in the classic punctuation character set. In that case, if you filter for abcd followed by an Arabic full stop, the tokenzier will see that as one word of five characters.

The syntax is:

Tweet: 

Hashtags and Cashtags

Hashtags

Hashtags frequently appear in Tweets. They're easy to identify because the first character is always #. For example:

    I found an awesome Big Data platform called #DataSift.

In this example, the hashtag is #DataSift. Because the hash symbol is classed as punctuation, hashtags are treated as separate words, so when you filter for "#apple" you are looking for the word "#" followed by the word "apple", with zero or more whitespace characters in between.

There are several different ways to filtering on hashtags. You could write:

This will match any interactions containing "#DataSift", or "# <whitespace> DataSift", as the "#" symbol is tokenized as a separate word.

Alternatively, you could write:

This will match any interactions containing "#DataSift" or "DataSift". As in the above example, the "#" symbol is tokenized as a separate word, so filtering for your hashtag without the "#" symbol will still match this hashtag.

Or you could use our twitter.hashtags target, which filters on hashtags in Tweets and ignores everything else in Tweets. The following piece of CSDL would therefore only match Tweets with the #DataSift hashtag:

 

Cashtags

DataSift uses 'cashtags' to allow you to filter for stock ticker symbols; for example:

This works because we have chosen to treat the $ symbol is treated as an alphabetical symbol, not punctuation.

 

Tweet: 

Japanese Chunking

DataSift can automatically inject whitespace into Japanese text to divide it into chunks before filtering. This pre-processing step allows our filtering engine to perform word or phrase matching correctly. Without this feature, you would be restricted to simple substring searches.

 

Syntax

You can use Japanese chunking with the contains and contains_any operators by adding an optional switch called "language". For example:

       interaction.content contains [language(ja)] "データシフト"

 

If you want to skip automatic chunking, use:

       interaction.content contains [language(none)] "データシフト"

 

By the way, "データシフト" is the Japanese translation of DataSift.

 

Examples

1.  Filter for a string. Here, we'll use the name of a Japanese TV show:

 

In this example, DataSift adds whitespace between the second and third characters because the title of the show "半沢直樹" is also the name of a person. The first two characters "半沢" are the last name and the remaining characters "直樹" are the first name.

The argument is the name of a new TV show so it is something that DataSift has not encountered before. Nevertheless, the chunking algorithm is powerful enough to handle this new "word" effectively.

 

2.  Filter for comments from the youth demographic by recognizing commonly encountered Internet slang:

 

The string "まじ" means "seriously" in Japanese. This keyword is often used by young people, and it frequently modifies another term. For example, "seriously interesting" is "まじ面白い".

The word "まじない" (which is Japanese for "magic spell") happens to contain the string "まじ" but the chunking algorithm is intelligent enough to leave "まじない" as a single chunk.

 

3.  Combining these filters and adding tags, we have:

 

Tweet: 

Mandarin Chunking

DataSift can automatically inject whitespace into Mandarin Chinese text to divide it into chunks before filtering. This pre-processing step allows our filtering engine to perform word or phrase matching correctly. Without this feature, you would be restricted to simple substring searches.

 

Syntax

You can use Mandarin Chinese chunking with the contains and contains_any operators by adding an optional switch called "language". For example:

 

If you want to skip automatic chunking, use:

 

By the way, “微博” is the Mandarin Chinese translation of "weibo" which means microblog.

Examples

1.  Filter for a string. Here, we'll use the name of a Chinese TV show:

In this example, DataSift adds whitespace between the second and third characters because the title of the show "甄嬛传" contains the name of a person (甄嬛).

The argument is the name of a new TV show so it is something that DataSift has not encountered before. Nevertheless, the chunking algorithm is powerful enough to handle this new "word" effectively.

 

2.  Filter for comments from the youth demographic by recognizing commonly encountered Internet slang:

The string "酷" means "cool" in Mandarin Chinese. This keyword is often used by young people, and it frequently modifies another term.

The word “残酷” (which is Mandarin Chinese for "cruel") happens to contain the string “酷” but the chunking algorithm is intelligent enough to leave “残酷” as a single chunk.

 

3.  Combining these filters and adding tags, we have:

 

Tweet: 

Twitter Mentions and Links

@Mentions

DataSift's filtering engine ignores all mentions of Twitter users in Tweets. Therefore, it would see this Tweet:

    If you like @DataSift you have to follow @DataSiftNews and @DataSiftDev every day.

as this:

    If you like you have to follow and every day.

Of course, this means that you cannot write a CSDL filter like this to look for mentions of @DataSift:

 

Instead, use the twitter.mentions target:

 

 

Links

URLs within Tweets are also invisible to the filtering engine. DataSift sees this Tweet:

    If you like @News surf to http://nytimes.com.

as this:

    If you like surf to.

Consequently, you cannot write a CSDL filter like this one for URLs:

 

Instead, use a Links Augmentation target:

 

 

Tweet: