In DataSift there are three sets of characters:
Characters that belong to the Drop Set indicate to DataSift that one word has finished and another may be coming. A good example of such a character is the space.
Add at least one space between each word
Characters that belong to the Keep Set also indicate to DataSift that a word has finished and another may be coming, but DataSift also treats them as a word in their own right. By default, the comma belongs to the Keep Set so this phrase contains five words; one of them is a comma:
Hi there, I'm home
Characters that do not belong to the Drop Set or the Keep Set are members of the Regular Set. The most common examples are uppercase letters, lowercase letters, and numbers. A message that contains nothing but characters in the Regular Set contains just one word:
Every character belongs to one (and only one) of these sets.
When you write CSDL filters using the contains, contains_any, and contains_near operators, you can specify exactly which characters you want to belong to each of these sets. You achieve this control by adding the drop and keep modifiers to these keywords.
Pre-configured Drop Set
There is a pre-configured default Drop Set. It contains these characters:
For example, each time DataSift encounters a space (either in the argument in one of your CSDL filters or in the incoming data) it knows that there is a word boundary and the space can be discarded.
Pre-configured Keep Sets
There are three pre-configured Keep Sets:
- The default drop punctuation character set is currently the classic set.
- The classic keep punctuation character set is the configuration that DataSift has used since we launched.
- The extended keep punctuation character set keeps all of the punctuation characters offered by classic, and adds more.
To choose the classic Keep Set, for example:
twitter.text contains [keep(classic)] "I love Big Data"
If you examine the classic keep set you'll see that it it contains groups of punctuation symbols such as: Pc, Pd, Pe, and Pf.
You can choose one of the individual groups:
twitter.text contains [keep(Pc)] "I love Big Data"
You can choose more than one of them:
twitter.text contains [keep(Pc + Pd + Pe)] "I love Big Data"
You can choose one of the pre-configured Keep Sets and then remove a group:
twitter.text contains [keep(classic - Pc)] "I love Big Data"
In this case, you are removing the members of Pc from your Keep Set so they become members of your Regular Set.
You can create a Keep Set from scratch:
twitter.text contains [keep("," + "." + "!")] "I love Big Data"
In this case, your Keep Set contains just three members.
You can remove individual characters from a pre-defined Keep Set:
twitter.text contains [keep(extended - "?")] "I love Big Data"
The same techniques work with the Drop Set. You can add a new character to your drop set:
twitter.text contains [drop(default + ",")] "I love Big Data"
You can create a new drop set from scratch:
twitter.text contains [drop("," + "." + ";")] "I love Big Data"
An alternative syntax here is:
twitter.text contains [drop(",.;")] "I love Big Data"
The Regular Set
Sometimes it is convenient to place a character in the Regular Set deliberately, to ensure that it receives no special treatment. For example, the Turkish word for cola is cola'ya with an apostrophe. The following CSDL ensures that the apostrophe belongs to neither the Drop Set nor the Keep Set so DataSift treats cola'ya as a single word of seven characters.
interaction.content contains [keep(default - "'"), drop(default - "'")] "cola'ya"