Blog posts in Communication

Gerrit Schultz's picture
Gerrit Schultz
Updated on Tuesday, 12 February, 2013 - 12:03

Gerrit Schultz describes the time he recently spent from August to November as a intern in the Development group at DataSift. 

 

I'm very happy that as part of my university studies I'm now having the chance to work as an intern with DataSift. It's certainly been a brilliant experience.

From the first day I've been involved in the regular development process. After only a few days I could see my first work results live in production. I had chosen to join the front-end team. The revamp of a central part of the UI including the Stream Preview and CSDL Code Editor, as well as the integration of the Query Builder as a new feature were planned for the next sprint. This meant that during the following month I could contribute to another big release and, at the same time, follow my goal of getting deeper into different JavaScript technologies. Being an active part of the team and, in a real-world scenario, working on a big code base allowed me to gather lots of hands-on experience.

An internship at DataSift is far from making coffee - unless you want some for yourself. The only time you spend in the kitchen is when you want to grab your favourite chocolate from the fridge. As an intern with DataSift you're given the chance to contribute to the development of serious software and broaden your knowledge in your own field of interest. I had the free choice of what I wanted the focus of my internship to be, and I've been given amazing support from everyone around me.

During the last few days I was now looking for some fresh input, trying out a bit of Scala. When I mentioned that I would be interested, I was immediately offered to have a few tasks of a project on filtering Facebook posts assigned to me, got an introduction to the existing code, and had a Scala book on my desk. Everything is possible as long as you are keen on trying out new stuff.

Besides that, the working atmosphere is fantastic. Sometimes you almost forget that you're in an office. Occasional foam bullet gun fights are just as much part of the work life as helicopters being manoeuvred through the room and people playing card games after enjoying some delicious catering provided by DataSift for lunch. Even events like going to the cinema are arranged from time to time.

Overall a great place to work and a very good choice for a challenging, enriching internship.

Gerrit Schultz 

 

We're always looking for good people. If you have what it takes, if you're looking for an internship or a placement year, or if you're a recent graduate, you can reach us at careers@datasift.com. Please ensure that you are eligible to work in the United Kingdom and that you approach us directly, not through a recruitment agency.

 
edstenson's picture
edstenson
Updated on Wednesday, 4 January, 2012 - 11:59

Introduction

You've probably written streams that use CSDL's native operators such as contains and any. You might not have tried our embedded regular expression (regex) engine yet. If you already know how to write a regex, just read our regular expression page, take a look at the escaping guidelines, check out our regex_partial and regex_exact keywords, and you'll be ready to write your first regex stream.

If you haven't used a regex before, read on...

 

Regular expressions can seem complex to newcomers. It's easy to believe that the learning curve is going to be steep. For example:

    twitter.text regex_partial "[A-Z][a-z]{1,11}\\, [A-W][A-Z]\\W"

 

Simple Examples

The good news is that many of regexs are easy to understand. Look at these:

  Find a "z" z
  Find any lowercase letter [a-z]
  Find a period \\.
  Find a comma \\,
  Find a any lowercase letter followed by a period [a-z]\\.
  Find any uppercase letter [A-Z]
  Find any lowercase letter followed by an uppercase letter [a-z][A-Z]
  Find any letter, regardless of case [a-zA-Z]

 

CSDL Regex Operators

There are two regex operators in DataSift:

  • regex_partial allows you to filter for a pattern anwhere in the body of a message
  • regex_exact allows you to filter for a match against the entire body of the message

 

Let's use one of our samples with regex_partial. This stream searches for any Tweet that includes a lowercase letter followed by an uppercase letter.

    twitter.text regex_partial "[a-z][A-Z]"

 

This stream filters for any Tweet that includes "hello".

    twitter.text regex_partial "hello"

 

This stream uses CSDL's regex_exact operator instead of regex_partial, to filter for any Tweet where the entire text is "hello". This example looks very similar to the preceding one but don't be fooled; the previous stream accepts Tweets that are up to140 characters long but this one rejects any Tweet longer than five characters. It's looking for an exact match on "hello":

    twitter.text regex_exact "hello"

 

More examples

The ? metacharacter is useful in a regex. It indicates that the preceding element must appear exactly 0 or 1 times. For example, this filter searches for Tweets that include the word color or colour:

    twitter.text regex_exact "colou?r"

 

The ? appears immediately after the thing that it applies to. Here's another example; this one filters for any sequence of 1, 2, or 3 lowercase letters in sequence:

    twitter.text regex_partial "[a-z]{1,3}"

 

And one final trick to remember is to use \\W to find any character that is neither a number nor a letter:

    twitter.text regex_partial "\\W"

Summary

You now have everything you need to decode that first example we showed:

    twitter.text regex_partial "[A-Z][a-z]{1,11}\\, [A-W][A-Z]\\W"

 

It looks for an uppercase letter followed by up to 11 lowercase letters, followed by a comma and a space, followed by two uppercase letters, the first of which must not be Y or Z. Finally it checks that the entire sequence is followed by a character that is neither a letter nor a number. 

Here are some examples of the content it might give you:

  • Chicago, IL
  • Seattle, WA
  • Los Angeles, CA
  • Westchester Firehouse, NY
  • Wright Patterson Air Force Base, OH

An alphabetical list of the abbreviations for US states ends with Wyoming so, by excluding Y and Z, we help to make our filter as focused as possible. It isn't perfect - it could certainly be refined futher. But it does demonstrate what a 33-character regex can do.

Want  to learn more? Our regular expression page includes links to tutorials and resources.

Agustin's picture
Agustin
Updated on Wednesday, 30 November, 2011 - 14:17

DataSift is the subject of the latest post on the High Scalability blog which includes a detailed overview of the platform architecture and the problems involved in meaningfully filtering unstructured data from the Twitter API  in real time.
 

‘You have to be able to reliably consume it, normalize it, merge it with other data, apply functions on it, store it, query it, distribute it, and oh yah, monetize it. Most of that in realish-time. And if you are trying to create a platform for allowing the entire Internet do to the same thing to the firehose, the challenge is exponentially harder.’

‘DataSift is in the exciting position of creating just such a firehose eating, data chomping machine. You see, DataSift has bought multi-year re-syndication rights from Twitter, which grants them access to the full Twitter firehose with the ability resell subsets of it to other parties, which could be anyone, but the primary target is of course businesses.’

‘DataSift's real innovation is in creating an Internet scale filtering system that can quickly evaluate very large filters (think Lady Gaga follower size) combined with the virtuous economics of virtualization, where the more customers you have the more money you make because they are sharing resources.’
 
 
Architecture In A Picture
 
‘DataSift has created an awesome picture of their overall architecture. Here's a small version, for the full sized version please go here.’
 
 
‘The diagram has two halves: everything on the left is data processing and everything on the right is data delivery. 40+ services run in the system, these include: license service, monitoring service, limit service, etc.’

‘The system as a whole has a number of different scaling challenges that must be solved nearly simultaneously: handling the firehose, low latency natural language processing and entity extraction on tweets, low latency in-line augmentation of tweets, low latency handling very large individual filters, low latency evaluation of a large number of complex filters from a large population of customers, link resolution and caching, keeping a history of the firehose by persisting the 1TB of data it sends each day, allowing analytics to be run on the history of the firehose, real-time billing, real-time authentication and authorization, a dashboard to let customers know the status of their streams, streaming filter results to 1000s of clients, monitoring every machine every filter and every subsystem, load and performance testing, handling high network traffic, and messaging between services in a low-latency fault tolerant manner.’
 
For more detail on DataSift’s architecture read the full article here.