Tokenization
The engine also separates the characters it sees into "tokens". Characters can be:
- alphanumeric
- whitespace
- punctuation
Let's look at alphanumeric and whitespace characters first. Every time DataSift finds one or more whitespace characters between alphanumeric strings, it starts a new token. For example, it would split this string into three tokens:
The Polar Express
The tokens are:
- The
- Polar
- Express
It also treats punctuation as a token so, if we add a period at the end of our text, DataSift would see four tokens:
The Polar Express.
Now the tokens are:
- The
- Polar
- Express
- .
DataSift groups alphanumeric characters together but it treats punctuation characters individually, so this string:
The Polar Express...
comprises six tokens; the final three of them are simply periods:
- The
- Polar
- Express
- .
- .
- .
Punctuation is:
Name | Symbol | |
Exclamation mark | ! | |
Double quotes | " | |
Hash | # | |
Percent | % | |
Ampersand | & | |
Single quote | ' | |
Open parenthesis | ( | |
Close parenthesis | ) | |
Asterisk | * | |
Plus | + | |
Comma | , | |
Dash | - | |
Period | . | |
Forward slash | / | |
Colon | : | |
Semicolon | ; | |
Less than | ||
Equals | = | |
Greater than | > | |
Question mark | ? | |
At | @ | |
Open square bracket | [ | |
Backslash | \ | |
Close square bracket | ] | |
Carat | ^ | |
Underscore | _ | |
Backtick or grave accent | ` | |
Open curly bracket | { | |
Pipe | | | |
Close curly bracket | } | |
Tilde | ~ | |
Left double quote | “ | |
Reversed double prime quote | ‵‵ | |
CJK reversed double prime quote | 〝 | |
Right double quote | ” | |
Double prime quote | ″ | |
CJK double prime quote | 〞 | |
Low double prime quote | 〟 | |
CJK ditto mark | 〃 | |
CJK left corner bracket | 「 | |
Left ceiling | ⌈ | |
CJK right corner bracket | 」 | |
Right floor | ⌋ | |
CJK left white corner bracket | 『 | |
CJK right white corner bracket | 』 |
How does tokenization affect my filters?
Suppose that you want to search for content that contains "50%"; that is, the number 50 followed immediately by a percent sign. The string would be represented by two tokens:
- 50
- %
Thus, a filter such as this one:
interaction.content contains "50%"
would match "50%" but it would also match "50 %" (note the inclusion of a space) because DataSift recognizes tokens even if they have whitespace between them.
In other words, the filter would match any of these strings:
I scored 50%
I bought 50 %AWESOME%
50 %25
So the filter would return false positives. How can we get around this problem? How can we construct a better filter? Using contains
and substr
together we can build a robust, reliable filter like this:
interaction.content contains "50%"
and interaction.content substr "50%"
Why can't we use
substr
alone? Because it would return any content that contained 50%, 150%, 250%, and so on.