Open Source Software at DataSift
DataSift is built on open source software. Here are some of the comments our developers have made on the subject:
"It's like having a bigger team"
"We learn from the best by reading and using their code."
"Without open source, we wouldn't have PHP, we woudn't have Python, we wouldn't have Perl."
"At DataSift, we're building a world-class platform, and we need to use the very best tools for the job."
From PHP to Hadoop, everything we do to filter over one billion items every day is built with components that the international community of developers has shared. Even our favorite data delivery format, JSON, is an open standard. It's obvious that the future lies in open source.
DataSift engineers contribute to and release a great deal of open source software. Some of the most important projects we use and contribute to include:
- Apache Hadoop - distributed computing framework, including HDFS and MapReduce
- Chef - configuration management tool
- Redis - advanced key-value store
- ZeroMQ - advanced socket library
Today, we're releasing a new data tool, the visual Query Builder. It's the latest in a series of open source projects, all of which are available from DataSift's GitHub account. Here's a summary of our recent work:
The Query Builder is a browser-based graphical tool that allows users to create and edit filters without needing to learn the DataSift Curated Stream Definition Language (CSDL). It started life as an internal project at DataSift where our staff quickly recognized its potential. The Query Builder is a serious tool that can be used to build complex CSDL filters without using DataSift's Code Editor.
Hubflow is an adaptation of GitFlow and the GitFlow tools git extension for working with GitHub.
If you look at Vincent Driessen’s original blog post, he’s listed all of the individual Git commands that you need to use to create all of the different branches in the GitFlow model. They’re all standard Git commands … and if you’re also still getting your head around Git (and still learning why it is different to centralised source control systems like Subversion, or replicated source control systems like Mercurial), it adds to what is already quite a steep learning curve.
Vincent created an extension for Git, called GitFlow, which turns most of the steps you need to do into one-line commands. At DataSift, we used it for six months, and we liked it - but we wanted it to do even more. We also wanted it to work better with GitHub, so to reduce confusion with the original GitFlow tools, we’ve decided to maintain our own fork of the GitFlow tools called HubFlow.
The Arrow dashboard is a visualization tool designed to show the full capabilities of DataSift. It's a framework that helps us to visualize and analyze DataSift's output streams. The goal was to find a way to show the huge amount of information that we filter. Arrow is open source too; in other words, we built this awesome project and we want you to play with it!
The visualizations are written using the D3 library for rendering. We currently support three types of visualizations: pie charts, line charts, and maps.
Here's a glimpse of one small part of Arrow but there's much, much more:
This suite of additional abstractions and utilities that extend Dropwizard. There are several modules:
|Sound of Twitter||
|Sublime Text CSDL plug-in||
Sublime Text plugin to validate and compile DataSift CSDL, consume a sample set of interactions, and enjoy correct syntax highlighting. Do it all without leaving Sublime Text!
Code and documentation licensing
The majority of open source software exclusively developed by DataSift is licensed under the liberal terms of the MIT License. The documentation is generally available under the Creative Commons Attribution 3.0 Unported License. In the end, you are free to use, modify and distribute any documentation, source code or examples within our open source projects as long as you adhere to the licensing conditions present within the projects.
Note that our engineers like to hack on their own open source projects in their free time. For code provided by our engineers outside of our official repositories on GitHub, DataSift does not grant any type of license, whether express or implied, to such code.
We support a variety of open source organizations and we're grateful to the open source community for their contributions. Our goal is to maintain our healthy, reciprocal relationship. If you have questions or encounter problems, please Tweet us at @DataSiftOS.