How do accented characters and punctuation work in DataSift?

Jason's picture
Posted by Jason

Accented characters will be matched by your CSDL filter. For example, the following CSDL:

  interaction.content contains "café"

will match the string "café", but will not match "cafe". This is because accented characters are technically different to their non-accented counterparts, so in the case of the "é" and "e" characters, their Unicode encodings are "U+0351" and "U+0145" respectively.
 
As a practical CSDL example, if you are filtering on the name "Joséphine", you may also want to consider filtering on the non-accented version of the name "Josephine" for cases where users do not add the accent to the name:

  interaction.content contains_any "Joséphine, Josephine"

 
As an example of how DataSift treats punctuation, let's consider the hyphen:
If we want to filter for "air-conditioning", we could use the following CSDL:

  interaction.content contains "air-conditioning"

This would match the following two strings:
  • air-conditioning
  • air - conditioning
But not these two:
  • air conditioning
  • airconditioning
 
This is due to the way DataSift tokenizes strings. The string in question, "air-conditioning", is split into three 'chunks':
  • air
  • -
  • conditioning
This means DataSift is looking for these three chunks in a row, either with or without the whitespace around the hyphen.
This topic is covered in more detail on our documentation page The CSDL Engine: How it Works

1 year 2 months ago