Lukas Klein describes the time he recently spent in August and September as a intern in the Development group at DataSift.
The last month has been really exciting for me. Me, that is 19-year-old student Lukas from Germany. I decided to work for DataSift as an intern before university, and it really paid out. Before, I have been working for either small one-man businesses or big players like SAP but never in such a fast growing startup. Even though I've never been to the valley before, I think working for DataSift is much like working for a San Francisco-based startup, except it's in the center of Europe.
The company is very engineering-driven, so the decisions are taken by people who really know how the technology works, which is a huge benefit if you're a developer. When I came into the office, which is located in the Enterprise Center of the University of Reading, for the first time, everyone welcomed me warmly (until the first Nerf gun battle started, but that's another story) and showed me around.
When you work at DataSift, nobody tells you what you have to use, you can choose whatever tools you want, whether it be a Mac or a PC running Linux (if you're using Windows, DataSift is not the right place for you, I guess), vim or Sublime Text, Coke or Pepsi (the fridge is always full!). All the people in the office are highly skilled and there's an expert for everything, whatever question you have. (I'm sure there's even someone who can help you with building a nuclear Nerf gun that can shoot Curiosity off Mars).
I used my time at DataSift to dive into new technology I've had little time to use before, such as node.js, Backbone or redis. Working with the DataSift API is really straightforward and at the end of my first day I had a visualization up and running that showed the current trending articles on bbc.co.uk in growing bubbles.
The great thing at DataSift is that you can turn almost any idea into a real product. One morning when I was working on my CSDL (DataSift's own curation language) I thought it would be nice if I could do this in my favorite editor, Sublime Text. So I simply wrote a plugin for Sublime Text, put it on GitHub, and minutes later Stuart forked it and helped me to extend it. Here in the office everyone is helping each other and if you're stuck at a problem, it can only be a matter of minutes until someone comes up with a great solution.
I've really enjoyed my time in the UK and I can only advise every student who's into cutting edge technology to check them out. Conclusion: If you don't mind getting hit by several foam bullets a day and know how to assemble IKEA furniture (you have to build your own desk), you should definitely come to DataSift!
We're always looking for good people. If you have what it takes, if you're looking for an internship or a placement year, or if you're a recent graduate, you can reach us at email@example.com. Please ensure that you are eligible to work in the United Kingdom and that you approach us directly, not through a recruitment agency.
DataSift offers organizations a cloud-based platform to filter for real-time social media data. Every second, social media sites generate massive amounts of data. This data can provide valuable insight to your organization. DataSift filters for content as it is posted. For instance, you could filter for the mention of an individual, a message posted on a social media site, or all messages posted within a specified location. DataSift offers you an integrated solution that filters, aggregates, and delivers the exact content that you need. This blog post aims to help you understand the various features that the DataSift platform offers.
Now that you are familiar with how DataSift works, let's look at the DataSift UI and learn how to get started.
The DataSift platform is easy to navigate and the first step is to create a DataSift account. You can register with your email address or you can use your Twitter, Facebook, LinkedIn, Google, Foursquare. or Yahoo account.
After signing up with DataSift, log in to your DataSift account to access your Dashboard. The Dashboard is the control panel for your account. You can manage your account from here and access many of DataSift's features. The Dashboard displays your API details which are required for authentication when you use the DataSift API.
You can also access Settings from the Dashboard, where you can manage your account settings, such as account details, billing details, data licenses, identities and password.
The Dashboard provides six tabs that navigate you to the different features that make up DataSift. Let's look at these features in brief.
You can create new streams or access existing streams by clicking on the Streams tab. You can create streams in the CSDL language using the Visual Query Builder or by writing CSDL code manually using the Code Editor.
Visual Query Builder
You don't have to be a developer to create filters for social media data streams. The Visual Query Builder allows you to construct filters for complex social media data streams without using the CSDL programming language. Simply choose a data source such as Twitter, then the relevant target field from a list of available target fields and, lastly, select or enter an argument describing what you want to filter.
You can customize the Visual Query Builder to allow users to build queries for a limited set of targets. It can also be integrated to match your organization's graphical identity scheme.
CSDL Code Editor
More advanced users such as developers, prefer to work directly in our CSDL Code Editor. To create a stream in CSDL in the Code Editor, simply enter the CSDL commands that define the content you want to filter for. When you click Save & Close, the editor validates your code and notifies you if it finds an error.
Once you have created a stream, you can:
- preview the output data from your stream.
- consume the stream via the API.
- record the stream and export the output data.
- create a Historics query for your stream.
- share the CSDL code of your stream.
Once you have created your first streams, you can perform tasks on them. To access or monitor these tasks, click the Tasks tab. All your existing tasks are displayed on this page. You can also delete your tasks or export data from your tasks. You can perform two main tasks on your streams:
- Create a recording of your stream by clicking the Start a Recording button.
- Create a Historics query of your stream by clicking the New Historics query button.
You can use the DataSift platform to filter for content from a range of data sources such as Twitter, Facebook, and Amazon. The Data Sources tab displays all the websites from where we acquire data for your streams. Our sources include a range of blogs, boards, media sharing websites as well as some of the most widely used social media sites. However, keep in mind that you must activate and sign a license for the data source if you want to receive their data in your stream output.
DataSift also offers you options to export your output data to a range of data destinations such as FTP, HTTP, SFTP, Amazon S3, Amazon DynamoDB, ElasticSearch, Splunk Storm, and so on. You can view or access these by clicking the Data Destinations tab. You can add or edit settings for individual destinations from here. You must also ensure that they are correctly configured and set up with their own unique settings, including authentication details. DataSift allows you to test the connection from the platform to your data destinations.
You can monitor your usage statistics and the costs of streams that are currently running, from the Billings tab. You can also view the total costs, usage, data volume, connected hours, and historic hours from last seven days.
DataSift offers state-of-the-art technology to filter real-time data relevant to your organization. DataSift offers this service through a feature-packed user interface that is intuitive and easy to use. The DataSift GUI can be used by non-developers as well as advanced users. You can create streams to filter for content, recordings of the streams, export output data from the streams, and create Historics queries to retrieve data from the past. You can also view the data sources through which we run your streams to filter for content. To export the output data from your streams to an external data storage, you can configure your own data destination. Any activities you perform through our UI or the API are logged in your usage statistics. You can also view your billing details and DPU usage.
To try out and preview the DataSift platform, sign up today for a free trial.
First off, there are a number of targets you should be aware of when filtering for @mentions of Twitter usernames:
twitter.user.id - The ID of the Twitter user that sent this Tweet
twitter.user.screen_name - The @username of the Twitter user that sent this Tweet
twitter.mentions - An array of Twitter @usernames that are mentioned in this Tweet
twitter.mention_ids - An array of Twitter user IDs that are mentioned in this Tweet
twitter.in_reply_to_screen_name - The @username of the Twitter user this Tweet is replying to. Note: This @username will also appear in the twitter.mentions and twitter.mention_ids arrays
twitter.retweet.user.id - The ID of the Twitter user that sent this retweet
twitter.retweet.user.screen_name - The @username of the Twitter user that sent this retweet
twitter.retweet.mentions - An array of Twitter @usernames that are mentioned in this retweet
Both Tweets and retweets
interaction.author.id - The author's ID on the service from which they generated a post. For example, their Twitter user ID
Secondly, there are two important syntax rules you should be familiar with when writing your CSDL. These rules are useful to know both when filtering for @mentions, and filtering for other keywords:
Use of ‘@’ symbols when filtering for Twitter @usernames
You should not use the ‘@’ symbol when filtering for usernames. Twitter usernames are passed on to us from Twitter as the bare username, without the appended ‘@’ symbol. Further details of how our CSDL filtering engine works with regard to @Mentions, URLs and punctuation can be found on the documentation page - The CSDL Engine : How it Works.
Use of the IN and CONTAINS_ANY operators
contains_any - Matches if one of your comma separated keywords or phrases are contained as words or phrases in the target field. For example, twitter.user.location contains_any “New, Old” will match locations such as “New York”, but not “Oldfield”.
in - Matches if your comma separated keywords or phrases are an exact match of the full content of the target field. For example, twitter.user.location in “New York” would match the location “New York”, but not “New York, NY”.
How to filter for users sending Tweets
The best targets to use if you want to filter on a list of users who are sending Tweets are twitter.user.id or twitter.user.screen_name. If you are only interested in people sending retweets, you would want to use the twitter.retweet.user.id or twitter.retweet.user.screen_name targets. If you would like to receive both Tweets and retweets, you will be better off using interaction.author.id in conjunction with ‘interaction.type == “twitter”’.
How to filter for Twitter users @mentioned in Tweets
You should use the twitter.mentions or twitter.mention_ids targets. People often try to filter incorrectly for Tweets containing mentions of @usernames using the following CSDL:
The DataSift filtering engine filters keywords by first stripping out any @mentions or links from the main body of text, and filtering them separately using the twitter.mentions targets and links augmentation respectively, so you should never be able to find a @mention by filtering on twitter.text or interaction.content.
Below is an example of a correct way to filter for @mentions of Twitter usernames within Tweets:
You could also use twitter.mention_ids to filter on the Twitter user ID, rather than the @username:
Today, Datasift announces two new link resolution services. First, we're delighted to be partnering with bitly, the #1 link sharing platform, which powers 75 percent of the world’s largest media companies and half of the Fortune 500 companies. With over 20,000 white-labeled domains, bitly generates 200M clicks/day.
In addition, DataSift's own Links augmentation is now live, too. This is a massive update to our old TweetMeme links aggregator. Until now, we used TweetMeme to fully resolve links embedded in Tweets and other data sources but, with the increased volume of links traffic coming from Twitter, we were close to hitting TweetMeme’s maximum capacity of 40M links/day.
What's the big deal?
These two services are complementary. By definition, 100 percent of bitly interactions relate to links and clicks on links. In fact, the data volume can be so high that, you might want to add an extra line of CSDL to throttle back the volume; here's how you do it:
But adding the Links augmentation provides even more opportunities for filtering. For example, here's some CSDL that filters in real time for clicks made within the UK on bitly links that lead to content with Apple in the title.
On top of that, DataSift's Links augmentation adds metadata to the interactions in your filters. I discuss the significance of metadata in another blog, Open Graph and Twitter Cards.
In DataSift, there are currently 15 targets that you can filter against. For example:
bitly.cname allows you to filter against custom names in bitly such as es.pn (for the ESPN sports network) or nyti.ms (for the New York Times).
bitly.referring_domain allows you to filter for clicks on links from particular domains; that is, links embedded on pages at domains you specify.
- bitly.country_code allows you to filter for clicks on bitly links from countries you specify.
Meanwhile, our Links augmentation offers 79 targets that you can filter against. For example:
links.meta.newskeywords allows you to filter against Google news keywords.
links.meta.twitter.card allows you to filter for Twitter Card content by type: summary, photo, or player.
- links.meta.opengraph.type allows you to filter for content that matches one or more of the multitude of Facebook Open Graph types.
A very compelling use case for bitly is trend analysis. We can already track Likes on Facebook or Retweets on Twitter and expect to see when something goes viral. But what if a story receives relatively little of this kind of attention but large numbers of clicks? To monitor clicks activity rather than sharing activity, bitly and the Links augmentation are perfect. For an all-round perspective, you could monitor bitly, the Links augmentation, Twitter retweets, and Facebook likes simultaneously.
The bitly.user.agent target is useful when you want to measure popularity of a particular web client. Find out which of your content is most popular on mobile devices.
Find out what percentage of people publishing or sharing information about a particular subject are located in a specified area. Take it one stage further and approximate the size of the area that people are Tweeting from. Find out if people enjoyed a rock concert, or to determine how quickly a wildfire is spreading.
The timezone for a click helps you find out when content is published or shared in different time zones. Use it to compare the kind of content popular in the morning in the US and mainland Europe.
Looking at the combined stats for bitly and the Links augmentation, DataSift resolves an average of 3,500 links per second, collecting the metadata at the same time and caching the results.
To learn more about these services and the engineering that makes them work, take a look at today's blog by Lorenzo Alberton, DataSift's Chief Technical Architect.
Bitly, the #1 link sharing platform, powering 75 percent of the world’s largest media companies and half of the Fortune 500 companies, with over 20,000 white-labeled domains, generates a lot of traffic: 80M new links a day and 200M clicks/day. In other words, on average, over 2,000 people click on a bitly link every second.
When we started exploring a partnership with bitly, we had to do some capacity planning to support their stream in our DataSift platform. We also took the opportunity to revise how we ingest high-volume streams and how we resolve links.
Firehose? More like raging rivers!
Luckily, we already had custom-built libraries to handle huge HTTP streams, which are split from one pipe into many more-manageable sub-streams, sliced and diced into single messages and fed into high-performance, persistent queues, ready to be processed by the rest of the platform. We learnt a lot from our experience with handling the Twitter Firehose, so it was just a matter of provisioning a few more machines and configuring the new “pipes”.
Links Resolution Architecture
Given the bitly stream is 100% represented by links data, we also got very excited by how we could use them. Until now, DataSift has been using the API of its own predecessor, TweetMeme, to fully resolve links embedded in Tweets and other data sources, but with the increased volume of links traffic coming from Twitter, we were close to hitting TweetMeme’s maximum capacity of 40M links/day.
What better opportunity to rebuild our links resolution service from the ground up? Now, resolving hundreds of links every second, with predictable latency and accuracy, is no easy task: 5 years of experience with TweetMeme proved to be an invaluable starting point. Taking it to the next level was a matter of redesigning the whole pipeline, keeping all the good parts and recoding where necessary.
The resulting architecture is summarized in the following diagram:
Every new link is cleaned and normalized, looked up in our cache, and immediately returned if found. If not, we try to follow a single hop, and we repeat the procedure until we reach the target page. Every hop is cleaned, normalized and cached, so we can resolve links sharing part of the chain much faster. At every hop, we evaluate the domain and we select specialized URL resolvers, which allow us to rewrite URLs on the fly and/or improve our cache hit ratio. We have stampede-protection systems in place, and watchdogs whose goal is to guarantee the lowest possible latency. Thanks to the new architecture, tens of workers can now resolve links in parallel to spread out the workload across different processes and different servers, so we can keep scaling out as the number of inbound links increases.
With these changes, we managed to reduce the (already low) failure rate, and vastly increase the number of successfully resolved links, instantly quadrupling the throughput on the same number of nodes. At the time of writing, we’re successfully resolving over 200M links/day.
Metadata > Data
The real power of DataSift is to tame huge volumes of traffic and to go well beyond what we receive from our data source, by augmenting each and every message with extra metadata, to add context, insights, and real quality to our inputs. So it was a natural evolution to extend the new and more powerful link resolution engine with extra features. We started with extracting the title of the target page, the language, the content type (we’re now able to resolve links to images and videos!), description, and keywords. These metadata fields are already adding a lot of value to the original message, which might not have any context otherwise. However, we didn’t stop there.
We decided to extract OpenGraph data, Twitter Cards data, and Google News keywords from every page too: almost 100 new properties about the page, the publication, the author, images and videos associated with a post, classification information, description, Twitter handles, information about entities (companies, people, places) mentioned, and so on. Enough information to make any data scientist drool. When we showed a preview to our internal data science team, they jumped around, happy for the early Christmas gifts.
Clicks + page metadata = Information heaven
By combining information about each click (user agent, geo location, domains, for example) and rich metadata about the target page, the number and quality of insights about what people visit and talk about has reached an entirely new level, and can really transform the meaning of social data mining. And with DataSift, accessing all this information has never been easier.