Regular Expressions: Examples

On this page we provide some examples of regular expressions in DataSift. The aim is to demonstrate how to use the regex_partial and regex_exact operators in CSDL and to show when and how to escape special characters. If you are new to regular expressions, we've provided links to tutorials, tools, and resources on our main regex page.

DataSift uses the RE2 regex engine. Be sure to take a look at the RE2 syntax.

 

regex_partial

1. Filter for "heey" or "heeey" or "heeeey" anywhere within a post. Note that no characters are escaped:

    interaction.content regex_partial "he{2,4}y"

 

2. Filter for those terms when they appear at the start of a post. Note that no characters are escaped:

    interaction.content regex_partial "^he{2,4}y"

 

3. Filter for this movie title, using either the British or American spelling of color:

    interaction.content regex_partial "The Colou?r Purple"

 

4. To filter for a string that includes a question mark, we must escape it with a backslash (\) in a regular expression (What\?) but because the backslash character has a special meaning in CSDL, we must escape it too:

    interaction.content regex_partial "What\\?"

 

5. Filter for any post that ends with an alphabetic character followed by a closing parenthesis. Note that we double-escaped the parenthesis:

    interaction.content regex_partial "[a-z]\\)$"

 

6. Filter for the character string "abc" followed by any character:

    interaction.content regex_partial "abc."

 

7.  Filter for the character string "abc" followed by a period. Note that we double-escaped the period:

    interaction.content regex_partial "abc\\."

 

8.  Filter for any Tweet that starts with a five-character word followed by a whitespace character. The lowercase s represents whitespace and we double-escaped it:

    interaction.content regex_partial "^[a-zA-Z]{5}\\s"

 

9.  Filter for the character string "hello" including the double quotes. Note that we escaped the double quotes and the actual regular expression is "hello":

    interaction.content regex_partial "\"hello\""

 

10.  Filter for the character string "hello," including the double quotes and the comma. The comma does not need to be escaped:

    interaction.content regex_partial "\"hello,\""

 

11.  Note the RE2 syntax for flags. For example, to create a case-insensitive filter for a string such as "zzzzzzzzzzz" we can write:

    interaction.content regex_partial "(?i)z{3,100}"

The question mark signals the start of the block of flags. In this example we need just one flag (?i).

 

regex_exact

The escaping rules for the regex_exact operator are identical to those for regex_partial. The only difference between the two operators is that regex_partial filters for the regex anywhere within a post whereas regex_exact requires the entire post to match the regex.

1. Filter for Tweet that read "Thank you!"

    interaction.content regex_exact "Thank you!"

Note that if there is any other text in the post, the filter will not normally match that post.

However, in a Tweet, Twitter handles are ignored during filtering so the regex will match a Tweet like this one:

    @datasift Thank you!

 

Tweet: