Regular Expressions

edstenson | 9th December 2011


You've probably written streams that use CSDL's native operators such as contains and any. You might not have tried our embedded regular expression (regex) engine yet. If you already know how to write a regex, just read our regular expression page, take a look at the escaping guidelines, check out our regex_partial and regex_exact keywords, and you'll be ready to write your first regex stream.

If you haven't used a regex before, read on...

Regular expressions can seem complex to newcomers. It's easy to believe that the learning curve is going to be steep. For example:

twitter.text regex_partial "[A-Z][a-z]{1,11}\\, [A-W][A-Z]\\W"

Simple Examples

The good news is that many of regexs are easy to understand. Look at these:

Find a "z" z
Find any lowercase letter [a-z]
Find a period \\.
Find a comma \\,
Find a any lowercase letter followed by a period [a-z]\\.
Find any uppercase letter [A-Z]
Find any lowercase letter followed by an uppercase letter [a-z][A-Z]
Find any letter, regardless of case [a-zA-Z]

CSDL Regex Operators

There are two regex operators in DataSift:

  • regex_partial allows you to filter for a pattern anwhere in the body of a message
  • regex_exact allows you to filter for a match against the entire body of the message

Let's use one of our samples with regex_partial. This stream searches for any Tweet that includes a lowercase letter followed by an uppercase letter.

twitter.text regex_partial "[a-z][A-Z]"

This stream filters for any Tweet that includes "hello".

twitter.text regex_partial "hello"

This stream uses CSDL's regex_exact operator instead of regex_partial, to filter for any Tweet where the entire text is "hello". This example looks very similar to the preceding one but don't be fooled; the previous stream accepts Tweets that are up to140 characters long but this one rejects any Tweet longer than five characters. It's looking for an exact match on "hello":

twitter.text regex_exact "hello"

More examples

The ? metacharacter is useful in a regex. It indicates that the preceding element must appear exactly 0 or 1 times. For example, this filter searches for Tweets that include the word color or colour:

twitter.text regex_exact "colou?r"

The ? appears immediately after the thing that it applies to. Here's another example; this one filters for any sequence of 1, 2, or 3 lowercase letters in sequence:

twitter.text regex_partial "[a-z]{1,3}"

And one final trick to remember is to use \\W to find any character that is neither a number nor a letter:

twitter.text regex_partial "\\W"


You now have everything you need to decode that first example we showed:

twitter.text regex_partial "[A-Z][a-z]{1,11}\\, [A-W][A-Z]\\W"

It looks for an uppercase letter followed by up to 11 lowercase letters, followed by a comma and a space, followed by two uppercase letters, the first of which must not be Y or Z. Finally it checks that the entire sequence is followed by a character that is neither a letter nor a number.

Here are some examples of the content it might give you:

  • Chicago, IL
  • Seattle, WA
  • Los Angeles, CA
  • Westchester Firehouse, NY
  • Wright Patterson Air Force Base, OH

An alphabetical list of the abbreviations for US states ends with Wyoming so, by excluding Y and Z, we help to make our filter as focused as possible. It isn't perfect - it could certainly be refined futher. But it does demonstrate what a 33-character regex can do.

Want to learn more? Our regular expression page includes links to tutorials and resources.

Previous post: Historical Architecture - Data Mining Billions of Tweets

Next post: Streams within streams