Today we announced the arrival of DataSift VEDO. In this post I’ll outline what this means to you as a developer or analyst.
DataSift VEDO gives you a robust solution to add structure to social data, solving one of the common challenges when working with unstructured ‘big data’. VEDO lets you define rules to classify data so that it fits your business model. The data delivered to your application needs less post-processing and is much easier to work with. The new features will save you time and give you a load more possibilities for your social data.
Data Is Meaningless Without Structure
When working with big data such as social content, one challenge you will always need to tackle is giving unstructured data meaningful structure. If you’re working with our platform currently, you will no doubt be extracting data to your server and running post-processing rules to organise the data to meet your needs.
Processing unstructured data is expensive and not much fun, but it’s where we excel. VEDO lets you offload processing onto our platform. You can now use CSDL (the same language you use for filtering) to add custom metadata labels and scores to data specifically for your use case.
Introducing Tagging And Scoring
VEDO introduces new features which let you attach this metadata, these are tagging and scoring.
Tagging allows you to categorize interactions to match your business model. Any interaction that matches a tagging rule will be given the appropriate text label, serving as a boolean flag to indicate whether an interaction belongs to a category.
Scoring builds on tagging allowing you to attach numerical values to interactions rather than just labels. Scoring allows you to build up a score over many rules, and allows you to model subtle concepts such as priority, intention and weighting.
As you begin to use tagging and scoring more and more, you will want to be able to organise your growing set of rules. To help we have also introduced tag namespaces and reusable tag definitions. Tag namespaces allow you to define taxonomies of tags. You can group tags at any number of levels in namespaces and build deep schemas to fully reflect your model. Reusable tag definitions allow you to perfect your rules and reuse them across any number of streams and projects.
Tagging and scoring are powerful features, but at this point you might not have grasped exactly how they can help you. Therefore alongside the tagging features we’ve also introduced a library of definitions to get you started. Some definitions you can use immediately in your streams (and benefit from our experience), and some serve as example definitions to show you what is now possible.
For example, we have definitions that help you score content for quality (such as how likely is the content a job advert?) and make it easier to exclude spam. On the other hand we have an example definition that shows how you can use the new features to classify conversations for customer service teams, picking out rants, raves and enquiries.
Although tagging is the main theme of the new release, there is an awful lot more happening here at DataSift. Alongside the release of VEDO we’re giving you more power, more connectivity and a wider range of sources to play with.
For instance we’ve just introduced delivery destinations for MySQL and PostgreSQL. These new destinations allow you to map your filtered data directly to a tabular schema and have it pushed directly into your database.
We’re also in the process of bringing many more sources onboard (you may have seen our recent announcements!), including many asian social networks.
Look out for improvements to help you work with a wider variety of languages, updates to our developer tools and client libraries, and much much more. I’ll cover these all soon.
Watch this space
In summary there’s far too much to cover in detail here. So watch this space, as over the coming weeks I’ll cover every feature of the new release in depth, with worked examples and sample code so you can take advantage of all these new powers for yourself.
If you’re new to DataSift, what’s stopping you? Register now and experience the power of our platform for yourself!!
On December 2nd, 2013, we plan to remove the "volume_info" field from the DataSift Historics API call response. Please ensure that your application does not expect to receive this field from Historics API calls by this date.
If you are using one of the official DataSift API client libraries, support for this has already been implemented in the following versions of the libraries:
- Java - 2.2.1+
- Python - 0.5.4+
- Ruby - 2.0.3+
- PHP - 2.1.4+
- .NET - 0.5.0+
I've noticed some questions from clients who are using Managed Sources for the first time. In this blog I'm going to go through the steps to run a DataSift filter on a Managed Source:
- Create a token
- Create a Managed Source
- Create a CSDL filter for that Managed Source
- Start recording the output of the filter
- Start the Managed Source
I'll use Facebook in my examples, but the process is similar for all the Managed Sources the platform offers.
Suppose that you have hundreds of Facebook pages about your brands, plus a body of content created by users or customers. DataSift can aggregate it all: your brand pages, campaign pages, competitor's pages, and pages from industry influencers.
In this blog I'm going to focus on our UI but you can set up and manage everything via API calls instead and, for production use, that's the way to go. To learn more about that process, read our step-by-step guide.
Just to set the scene, DataSift offers two types of data source:
A public source (Youtube, for example) is one that anyone can access. A Managed Source is one that requires you to supply valid authentication credentials before you can use it.
The first task is to create an OAuth token that DataSift will use for authentication. The good news is that you don't even need to know what an OAuth token is, because it's generated automatically:
1. Log in and go to Data Sources -> Managed Sources.
2. Click on the Facebook tile.
3. Click Add Token.
A popup box appears, inviting your to sign into your Facebook account. If you look at the URL in the popup's address bar, you'll see that it's served by Facebook, not by us. That means you're giving your Facebook credentials to Facebook privately, just as you do any other time you sign in. You are not giving them to us and we cannot see them.
4. Log in to Facebook in the popup box.
The popup closes and you will now see that you have a token.
From now on, any time you run a filter in DataSift against this Managed Source, DataSift will use the token to gain access. It's secure; if you want to stop using the token, you can delete it from DataSift by clicking the red X. Or, in your Apps settings in Facebook, you can revoke it. If you do that, the token becomes useless.
5. In the Name field, specify a name for your Managed Source. Here, I've called it "Example".
6. Type a search term in the Search box and click Search. Here I'm going to monitor Ferrari cars and merchandise.
DataSift lists all the accounts that match your search term. Select which ones you want to include in your filtering. In this example, I've chosen the candidate with the greatest number of likes.
8. Click Save
9. Click the My Managed Sources tab. You will see the source you just defined. Notice that the Start button is orange whereas the other two sources, which I defined before I took this screenshot, have a Stop button. It's important that you don't click Start yet. The first time you click it, DataSift delivers a backlog of posts from the past seven days. You need to create a stream and start a recording to capture those posts otherwise they'll be lost. The next few steps explain how to do that.
10. Click on your Managed Source, "Example" in this case. DataSift displays the definition page for the source.
11. Click How to Use. Now you can grab the CSDL code for this Managed Source. It's a simple one-line filter that uses the source.id target and the unique id for the source you just defined.
12. Copy the CSDL code to the clipboard:
source.id == "c07504cc3a324848ba1fb5905287799b"
13. Create a filter with that CSDL. You're probably very familiar with this step already. Just click the Create Stream button, paste the CSDL code in from my clipboard, and save it.
Now you need to start recording the output of that filter. Recordings are under the Tasks tab in DataSift.
14. Click Start a Recording.
15. Choose the filter that you created in Step 13.
16. Click Start Now and choose and end time for your recording. For this first test, I'd recommend that you don't choose a long duration.
17. Click Continue and then click Start Task.
18. Now go back to My Managed Sources and click Start.
Your filter will start to collect data from the source and DataSift will record it automatically.
That's all you need to know to use Managed Sources from the UI. Notice that you didn't even need to write a filter to get started; the platform provided the code for you. And by starting the recording before you ran the filter, you made sure that no data was lost.
The Pull Connector is the latest addition to our growing family of Push connectors. This new Push connector takes its name after the mechanism used to deliver the interactions you filter for: you pull data from our platform instead of us pushing it to you.
Even though the name of this connector might seem to be out of place for a Push connector, it makes sense to classify it as another Push connector, because it uses the same robust Push subsystem that powers other DataSift Push Connectors.
We designed it specifically for the clients who are firewalled from the public internet and prefer to keep and process data in house. The Pull Connector provides the following benefits:
Firewalls and network security policies are no longer an issue.
With Pull, there is no need to set up public endpoints. It simplifies firewall and network management on your side.
For example, you no longer need to ask your operations team to loosen up the firewall rules to enable connections from DataSift to a host that will receive data. They will not have to give up a precious public IP address or think of ways of redirecting traffic to a shared IP address.
Also, a change of the IP address of the host receving data does not require a call to /push/update.
Data collection and processing at your own pace.
The Pull Connector uses the Push data queuing subsystem. Your data is stored for an hour in a Push queue, giving you freedom to collect it as often as you want (up to twice per second per Push subscription ID) and to request as much of it as you want, in batches of up to 20MB.
You can retrieve data again, if necessary.
If you need to request data again, you can go back in time for up to an hour using the queue cursor mechanism. It lets you retrieve data from the queue again in case it gets lost. You have up to one hour to retrieve it, which should give you plenty of time to handle technical problems.
When you combine the robust foundations of the Push subsystem, the freedom to collect data at your own pace, and the ease of setting up a data collection and processing system without having to make changes to your organization's network and security setup, the Pull Connector becomes a very attractive solution.
And we saved the best for last, even though the Pull Connector introduces a new endpoint, /pull, for data collection, we implemented it using the same REST API you are already familiar with. You set it up just like any other Push connector and then call /pull to get your data.