Controlling Punctuation Characters

In DataSift there are three sets of characters:

Drop Set

Characters that belong to the Drop Set indicate to DataSift that one word has finished and another may be coming. A good example of such a character is the space.

Add at least one space between each word

Keep Set

Characters that belong to the Keep Set also indicate to DataSift that a word has finished and another may be coming, but DataSift also treats them as a word in their own right. By default, the comma belongs to the Keep Set so this phrase contains five words; one of them is a comma:

Hi there, I'm home

Regular Set

Characters that do not belong to the Drop Set or the Keep Set are members of the Regular Set. The most common examples are uppercase letters, lowercase letters, and numbers. A message that contains nothing but characters in the Regular Set contains just one word:

ThisIsAVeryVeryVeryVeryLongWord12345

Every character belongs to one (and only one) of these sets.

When you write CSDL filters using the contains, contains_any, and contains_near operators, you can specify exactly which characters you want to belong to each of these sets. You achieve this control by adding the drop and keep modifiers to these keywords.

Pre-configured Drop Set

There is a pre-configured default Drop Set. It contains these characters:

  • space
  • tab
  • new-line
  • carriage-return
  • form-feed

For example, each time DataSift encounters a space (either in the argument in one of your CSDL filters or in the incoming data) it knows that there is a word boundary and the space can be discarded.

Pre-configured Keep Sets

There are three pre-configured Keep Sets:

To choose the classic Keep Set, for example:

twitter.text contains [keep(classic)] "I love Big Data"

Low-level control

If you examine the classic keep set you'll see that it it contains groups of punctuation symbols such as: Pc, Pd, Pe, and Pf.

You can choose one of the individual groups:

twitter.text contains [keep(Pc)] "I love Big Data"

You can choose more than one of them:

twitter.text contains [keep(Pc + Pd + Pe)] "I love Big Data"

You can choose one of the pre-configured Keep Sets and then remove a group:

twitter.text contains [keep(classic - Pc)] "I love Big Data"

In this case, you are removing the members of Pc from your Keep Set so they become members of your Regular Set.

You can create a Keep Set from scratch:

twitter.text contains [keep("," + "." + "!")] "I love Big Data"

In this case, your Keep Set contains just three members.

You can remove individual characters from a pre-defined Keep Set:

twitter.text contains [keep(extended - "?")] "I love Big Data"

The same techniques work with the Drop Set. You can add a new character to your drop set:

twitter.text contains [drop(default + ",")] "I love Big Data"

You can create a new drop set from scratch:

twitter.text contains [drop("," + "." + ";")] "I love Big Data"

An alternative syntax here is:

twitter.text contains [drop(",.;")] "I love Big Data"

The Regular Set

Sometimes it is convenient to place a character in the Regular Set deliberately, to ensure that it receives no special treatment. For example, the Turkish word for cola is cola'ya with an apostrophe. The following CSDL ensures that the apostrophe belongs to neither the Drop Set nor the Keep Set so DataSift treats cola'ya as a single word of seven characters.

interaction.content contains [keep(default - "'"), drop(default - "'")]  "cola'ya"