Richard Caudle | 5th February 2014
At DataSift we pride ourselves on the power and flexibility of our filtering engine. One feature customers have asked for though is the ability to use wildcards in text-based filter conditions. Wildcards are now available on the platform to make writing powerful queries even simpler.
Firstly, it's probably worth clarifying what a wildcard is in this context. The wildcard operator on DataSift allows you to match (when filtering and tagging) against text values where there is a range of possibilities.
This can be extremely useful when you need to cover a collection of similar terms, misspellings, and incorrect or absent letter accents.
Think of the wildcard operator as another weapon in your arsenal. Although not as powerful as a regular expression, a wildcard is more easily understood and created, and incurs a lower cost. Of course in some situations the precision of a regular expression may still be the best choice.
Imagine you have a situation where you'd like to write a filter for a term, but there are multiple variations of that term. This is common in many languages, for instance imagine you'd like to filter for anything relating to printers and printing. Keywords would include:
print, prints, printer, printers, printing, printable
You could write a regular expression to cover these possibilities, but let's be honest regular expressions though powerful can be a real headache.
Instead using wildcards you can now simply write:
interaction.content wildcard "print*"
Here the * matches any character 0 or more times and would match all the words in our list.
In fact there are two wildcard characters you can make use of:
- * - Matches 0 or more characters
- ? - Matches exactly one character
The ? character is useful when looking for strings of a known pattern. For instance imagine filtering for a word that is commonly misspelt. Such as relevance:
interaction.content wildcard "relev?nce"
In fact you can give the wildcard operator a list of terms to match, such as:
interaction.content wildcard "relev?nce, relev?nt"
The wildcard operator is documented here.
Query Builder Support
You can use the wildcard operator in CSDL or the Query Builder tool. In Query Builder the option is list alongside "contain any" and the other text operators:
By default the wildcard operator is not case sensitive. You can though use the cs keyword to apply case sensitivity. For example to match common misspellings of Massachusetts:
interaction.content cs wildcard "Massachus*ts"
Often though authors are just as likely to not capitalise proper nouns, so I'd only recommend using this option when you are sure it is the appropriate implementation.
That's All For Now...
Hopefully you'll find the new operator will help you write your filters more easily than using a regular expression. Remember, it's just one of growing list of text operators that make text matching precise and powerful!
To stay in touch with all the latest developer news please subscribe to our RSS feed at http://dev.datasift.com/blog/feed