Tokenization

Tokenization

The engine also separates the characters it sees into "tokens". Characters can be:

  • alphanumeric
  • whitespace
  • punctuation

Let's look at alphanumeric and whitespace characters first. Every time DataSift finds one or more whitespace characters between alphanumeric strings, it starts a new token. For example, it would split this string into three tokens:

The Polar Express

The tokens are:

  • The
  • Polar
  • Express

It also treats punctuation as a token so, if we add a period at the end of our text, DataSift would see four tokens:

The Polar Express.

Now the tokens are:

  • The
  • Polar
  • Express
  • .

DataSift groups alphanumeric characters together but it treats punctuation characters individually, so this string:

The Polar Express...

comprises six tokens; the final three of them are simply periods:

  • The
  • Polar
  • Express
  • .
  • .
  • .

Punctuation is:

Name Symbol
Exclamation mark !
Double quotes "
Hash #
Percent %
Ampersand &
Single quote '
Open parenthesis (
Close parenthesis )
Asterisk *
Plus +
Comma ,
Dash -
Period .
Forward slash /
Colon :
Semicolon ;
Less than
Equals =
Greater than >
Question mark ?
At @
Open square bracket [
Backslash \
Close square bracket ]
Carat ^
Underscore _
Backtick or grave accent `
Open curly bracket {
Pipe |
Close curly bracket }
Tilde ~
Left double quote
Reversed double prime quote ‵‵
CJK reversed double prime quote
Right double quote
Double prime quote
CJK double prime quote
Low double prime quote
CJK ditto mark
CJK left corner bracket
Left ceiling
CJK right corner bracket
Right floor
CJK left white corner bracket
CJK right white corner bracket

How does tokenization affect my filters?

Suppose that you want to search for content that contains "50%"; that is, the number 50 followed immediately by a percent sign. The string would be represented by two tokens:

  • 50
  • %

Thus, a filter such as this one:

interaction.content contains "50%"

would match "50%" but it would also match "50 %" (note the inclusion of a space) because DataSift recognizes tokens even if they have whitespace between them.

In other words, the filter would match any of these strings:

I scored 50%

I bought 50 %AWESOME%

50 %25

So the filter would return false positives. How can we get around this problem? How can we construct a better filter? Using contains and substr together we can build a robust, reliable filter like this:

interaction.content contains "50%"
and interaction.content substr "50%"

Why can't we use substr alone? Because it would return any content that contained 50%, 150%, 250%, and so on.