Like any developer-friendly company, DataSift too has fans of the good old Vim editor working for us and with us. And since we spend so much time inside Vim, it is no wonder that we use it to write CSDL too. Which is why today I'm especially happy to announce that CSDL syntax highlighiting has been added to the Vim source code repository and should be shipping with all major operating systems worth using soon. (An OS worth using is the one that ships with Vim, of course.)
In the meantime, if you absolutely cannot wait to try it but don't want to build Vim from sources, grab a copy of the datasift-vim repository, unpack it, and follow the instructions. (You must have Vim installed first, of course.)
Vim will higlight CSDL automatically when you edit files with the .csdl filename extension. If you want to force CSDL syntax higlighting while you are editing, do the following:
- Press Esc
- Type :syn on
- Press Enter
- Press Esc
- Type :set syntax=csdl
- Press Enter
To achieve the same effect In gVim or MacVim, select Syntax -> Show filetypes in menu and then select Syntax -> C -> CSDL.
So, there you have it. If you like Vim and you like CSDL, the two are best pals now. Enojy our syntax file, and if you spot any problems with it, let us know. The datasift-vim project is Open Source and we do welcome patches, comments, and suggestions.
And if you like Vim, don't forget the good cause that Vim has been promoting for years.
PS. I'd like to thank Bram Moolenaar for adding my patches to Vim. It means a lot.
Social media gives us a way to sample trends and sentiment in real time. Consequently, it is very important that the analysis of the data we are looking at also happens in real time. And we want to help you, because here at DataSift we want our platform to be the Swiss Army knife of the social media analysis tools. We try to be flexible and do as much of the hard work as possible so that you can focus on analyzing the data instead of having to think how to feed it into your processing pipeline.
We strive to achieve that goal with our advanced Push data delivery system and its ever-growing set of connectors that can deliver the data you filter for to a variety of destinations. These could be third-party cloud storage services, such as Amazon AWS DynamoDB or an instance of CouchDB running on your own server. If there is a way to connect to a machine via the internet, we want to be able to deliver data to it.
When time is of essence and you absolutely must be able to start analysing data as soon as you receive it, then keeping data in RAM will help you shorten the time needed to access and process it. One popular tool for managing data in memory is Redis, an Open Source key-value store. And today we are very happy to announce the immediate availability of our new Redis connector, which will deliver the data you filter for straight to your Redis instance.
Getting started with Redis
It is your responsibility to set up your own instance of Redis and make sure it can be reached via the internet. If you have never used Redis before, we have help to get you started. Then it is just a matter of setting up a subscription via our Push API. The data will then be delivered straight into your Redis server ready for processing.
At your end, you will need a way to connect to your Redis server and you can do that with one of Redis clients. Many are available and you should be able to find the one that fits your needs quite easily.
The client alone is just a part of equation. You will also need software that can unpack interactions you get from DataSift from JSON into another format and look for the answers to your questions. Just like Redis, JSON is very well supported and many programing languages include appropriate libraries by default. As for data analysis tools, you will be the best judge of their usefulness, and it always is a good idea to ask your community for suggestions when you are not sure.
Please remember that you will be more likely to get reliable results if you start your analysis with a well-defined data set. That is where a well-written set of CSDL filters can help you pick out the most relevant interactions for further processing.
Those pesky limits (and how to cheat around them)
Keeping data in RAM lets you avoid delays caused by slow disk read and write operations, but that convenience comes at a cost: RAM is volatile and usually not available in large quantities even on high-end servers. It is also expensive to buy. Fortunately, you can architect a solution that reads data from a Redis store and saves it to disk, you can also rent servers with 110GB of RAM or more on a hourly basis, which can be a very cost-effective alternative to buying them or leasing on a long-term contract. Amazon AWS EC2 High-Memory instances are one such solution.
The issue of volatility is important when you do not want to lose data. You can avoid problems by storing multiple copies of data on two or more servers either by replicating it yourself or by creating two or more Push subscriptions based on the same stream hash. You can also make backups of the data held in memory to disk.
And if you really lose data you can retrieve it again using Historics. There will be a delay in receiving data, which may render it no longer relevant, but please keep in mind that there is a way to "replay" your analysis albeit at additional cost of running a Historics query.
RAM size constraints are also fairly easy to overcome. If the data you want to analyze does not fit inside the physical memory installed in your machine, you will need to add RAM, get a machine with more RAM, or use a piece of software that can manage a farm of Redis servers, such as the Redis LightCloud manager.
If your social media analysis business needs to work in real-time our Redis connector is the tool that will help you get further ahead of your competition. Go mad, build something amazing, and let us know how else we could be helping you achieve your goals!
This post was written by Jacek Artymiak with valuable input from Ollie Parsley, the developer of the Redis Push Connector.
At DataSift we love open source. We use it and we create it. As part of our commitment, we're proud to announce that a major new component of the DataSift platform, the Query Builder, is now available. It's open source and you can download it from GitHub today. Take a look at our demo page to try out the Query Builder.
What is the Query Builder?
Everyone talks about Big Data, but not many people know how to handle it. We live it. We created the Query Builder to bring the advanced functionality of DataSift to business users.
We consume over a billion items per day, processing them, augmenting them with analytical data, and making them available in JSON format. The Query Builder includes a built-in dictionary that shows all 450 of the different targets that users can include in their DataSift filters, so even novices can get started right away.
The Query Builder is a code generator that produces SQL-like commands that users can share. It does everything via a point-and-click interface where users create queries visually. They can use the features of the Advanced Logic Editor, shown above, to build complex filters by combining simpler ones.
Responsive design and standards compliance for the post-PC era
You worked hard on your site and the last thing you want to put on it is an ugly widget that clashes with the rest of the page. Rest assured that we've put a lot of time and effort into making sure the Query Builder is standards-compliant, responsive, and ready for post-PC touch screen devices. We strive to follow the latest standards for good design, responsiveness, and programming, be they official or commonly agreed upon. In a browser, the Query Builder supports IE7+, Firefox 5+, Safari 4.1+, Opera 12+, and Chrome 12+.
The Query Builder project is hosted on GitHub. When you want to embed it on a web page, log into your server, change the working directory to the document root directory, and then clone the repository with a single command:
git clone https://github.com/datasift/editor.git
Alternatively, download the project archive and unpack it to the document root directory on your web server.
In both cases, you should end up with a directory that contains a number of subdirectories. Most of the time you will only need datasift-editor/minified, unless you want to do some deep modifications of the code base and the resources. But make sure you read our configuration guides before you do that; in most cases, you only need to make small modifications to the Query Builder object initialization code. This is done by overriding the exposed configuration options.
This process enables users to generate and share CSDL code without knowing how to program. All that power is available without having to learn how to write a single line of code. Simply clone the Query Builder repository or upload the files that the Query Builder needs to run to your server and add eight lines of HTML code to the page where you want to embed it.
Modular and highly-customizable by nature, the Query Builder is easy to embed on a web page, blog, or inside a web view in a desktop or mobile application. You can customize it to match a variety of requirements for integration and branding.
Customizing the Query Builder
The Query Builder code you can download today from GitHub is exactly the same code we use on our website. We give you full freedom of choice when it comes to the use of our code and the approach to implementation.
The simplest form of customization you would perform might be to make the Query Builder follow the look and feel of your site. This is easily done by overriding the CSS style definitions with your own modifications. If you want to go one stage farther, you can replace the Query Builder's graphical assets with your own. The design of the CSS stylesheet is optimized to facilitate quick changes with minimal effort. When you want to add your own CSS, simply import it after the original stylesheet and all will be well.
We have added built-in help in the form of tool tips so that end users of the Query Builder can learn more about DataSift's targets and operators. These are downloaded directly from our servers, so any changes will appear on your users' screens as soon as they are published, without you having to do anything unless you want to jump in and create your own tool tips.
Connecting to DataSift
Once your implementation of the Query Builder is fully operational, it's time to connect it to our plaform. You need to capture the JCSDL generated by your users, pass it on to DataSift, capture the results, and present them back to the user. You have full freedom to implement your own solution here as well as full freedom of user management.
This is where you can add a lot of your own creativity and value. Processing and presentation of the results is one important area where you can create your own tools and make your users happy. We have prepared a sample implementation to get you started. Read through the code, try it, see what it does, and create your own magic. And you do not have to worry about backward compatibility. If you follow our configuration procedures, upgrading your installation of the Query Builder will be as simple as unpacking an archive.
You are also free to manage your own users in any way you like. You can choose to require your users to provide their own DataSift credentials or you can use a single set of DataSift credentials for company-wide access without having to manage multilpe accounts. Or you could manage your users' accounts for them based on their internal credentials.
So, there you have it. Now go make something amazing and let the world know about it.
DataSift is built on open source software. Here are some of the comments our developers have made on the subject:
"It's like having a bigger team"
"We learn from the best by reading and using their code."
"Without open source, we wouldn't have PHP, we woudn't have Python, we wouldn't have Perl."
"At DataSift, we're building a world-class platform, and we need to use the very best tools for the job."
From PHP to Hadoop, everything we do to filter over one billion items every day is built with components that the international community of developers has shared. Even our favorite data delivery format, JSON, is an open standard. It's obvious that the future lies in open source.
DataSift engineers contribute to and release a great deal of open source software. Some of the most important projects we use and contribute to include:
- Apache Hadoop - distributed computing framework, including HDFS and MapReduce
- Chef - configuration management tool
- Redis - advanced key-value store
- ZeroMQ - advanced socket library
Today, we're releasing a new data tool, the visual Query Builder. It's the latest in a series of open source projects, all of which are available from DataSift's GitHub account. Here's a summary of our recent work:
The Query Builder is a browser-based graphical tool that allows users to create and edit filters without needing to learn the DataSift Curated Stream Definition Language (CSDL). It started life as an internal project at DataSift where our staff quickly recognized its potential. The Query Builder is a serious tool that can be used to build complex CSDL filters without using DataSift's Code Editor.
Hubflow is an adaptation of GitFlow and the GitFlow tools git extension for working with GitHub.
If you look at Vincent Driessen’s original blog post, he’s listed all of the individual Git commands that you need to use to create all of the different branches in the GitFlow model. They’re all standard Git commands … and if you’re also still getting your head around Git (and still learning why it is different to centralised source control systems like Subversion, or replicated source control systems like Mercurial), it adds to what is already quite a steep learning curve.
Vincent created an extension for Git, called GitFlow, which turns most of the steps you need to do into one-line commands. At DataSift, we used it for six months, and we liked it - but we wanted it to do even more. We also wanted it to work better with GitHub, so to reduce confusion with the original GitFlow tools, we’ve decided to maintain our own fork of the GitFlow tools called HubFlow.
The Arrow dashboard is a visualization tool designed to show the full capabilities of DataSift. It's a framework that helps us to visualize and analyze DataSift's output streams. The goal was to find a way to show the huge amount of information that we filter. Arrow is open source too; in other words, we built this awesome project and we want you to play with it!
The visualizations are written using the D3 library for rendering. We currently support three types of visualizations: pie charts, line charts, and maps.
Here's a glimpse of one small part of Arrow but there's much, much more:
This suite of additional abstractions and utilities that extend Dropwizard. There are several modules:
|Sound of Twitter||
|Sublime Text CSDL plug-in||
Sublime Text plugin to validate and compile DataSift CSDL, consume a sample set of interactions, and enjoy correct syntax highlighting. Do it all without leaving Sublime Text!
Code and documentation licensing
The majority of open source software exclusively developed by DataSift is licensed under the liberal terms of the MIT License. The documentation is generally available under the Creative Commons Attribution 3.0 Unported License. In the end, you are free to use, modify and distribute any documentation, source code or examples within our open source projects as long as you adhere to the licensing conditions present within the projects.
Note that our engineers like to hack on their own open source projects in their free time. For code provided by our engineers outside of our official repositories on GitHub, DataSift does not grant any type of license, whether express or implied, to such code.
We support a variety of open source organizations and we're grateful to the open source community for their contributions. Our goal is to maintain our healthy, reciprocal relationship. If you have questions or encounter problems, please Tweet us at @DataSiftOS.
Gerrit Schultz describes the time he recently spent from August to November as a intern in the Development group at DataSift.
I'm very happy that as part of my university studies I'm now having the chance to work as an intern with DataSift. It's certainly been a brilliant experience.
An internship at DataSift is far from making coffee - unless you want some for yourself. The only time you spend in the kitchen is when you want to grab your favourite chocolate from the fridge. As an intern with DataSift you're given the chance to contribute to the development of serious software and broaden your knowledge in your own field of interest. I had the free choice of what I wanted the focus of my internship to be, and I've been given amazing support from everyone around me.
During the last few days I was now looking for some fresh input, trying out a bit of Scala. When I mentioned that I would be interested, I was immediately offered to have a few tasks of a project on filtering Facebook posts assigned to me, got an introduction to the existing code, and had a Scala book on my desk. Everything is possible as long as you are keen on trying out new stuff.
Besides that, the working atmosphere is fantastic. Sometimes you almost forget that you're in an office. Occasional foam bullet gun fights are just as much part of the work life as helicopters being manoeuvred through the room and people playing card games after enjoying some delicious catering provided by DataSift for lunch. Even events like going to the cinema are arranged from time to time.
Overall a great place to work and a very good choice for a challenging, enriching internship.
We're always looking for good people. If you have what it takes, if you're looking for an internship or a placement year, or if you're a recent graduate, you can reach us at firstname.lastname@example.org. Please ensure that you are eligible to work in the United Kingdom and that you approach us directly, not through a recruitment agency.