Blog

Jason's picture
Jason
Updated on Thursday, 10 October, 2013 - 17:35
DataSift is adding a new metadata field to each JSON object delivered via Push in the json_meta output format - a delivered_at timestamp. This new timestamp represents the time DataSift delivered this particular object. An example of a json_meta formatted Push delivery containing this new field can be seen below:
 
{"count":3, "hash":"4ede6111534c5e29145f", "hash_type":"historic", "id":"58802d124916ed826a08d58d791f85c5", "delivered_at":"Tue, 08 Oct 2013 09:53:33 +0000" "interactions":[{...
 
Please ensure your application is capable of accepting new output fields to prevent this change from interrupting your data delivery. This change is due to be released on Monday, October 14th, 2013.
Ed Stenson's picture
Ed Stenson
Updated on Thursday, 3 October, 2013 - 15:27

I've noticed some questions from clients who are using Managed Sources for the first time. In this blog I'm going to go through the steps to run a DataSift filter on a Managed Source:

  1. Create a token
  2. Create a Managed Source
  3. Create a CSDL filter for that Managed Source
  4. Start recording the output of the filter
  5. Start the Managed Source

I'll use Facebook in my examples, but the process is similar for all the Managed Sources the platform offers.

Suppose that you have hundreds of Facebook pages about your brands, plus a body of content created by users or customers. DataSift can aggregate it all: your brand pages, campaign pages, competitor's pages, and pages from industry influencers.

In this blog I'm going to focus on our UI but you can set up and manage everything via API calls instead and, for production use, that's the way to go. To learn more about that process, read our step-by-step guide.

Just to set the scene, DataSift offers two types of data source:

  • Public
  • Managed 

A public source (Youtube, for example) is one that anyone can access. A Managed Source is one that requires you to supply valid authentication credentials before you can use it.

 

Create a token

The first task is to create an OAuth token that DataSift will use for authentication. The good news is that you don't even need to know what an OAuth token is, because it's generated automatically:

1.  Log in and go to Data Sources -> Managed Sources.

 

2.  Click on the Facebook tile.

 

3.  Click Add Token.

 

A popup box appears, inviting your to sign into your Facebook account. If you look at the URL in the popup's address bar, you'll see that it's served by Facebook, not by us. That means you're giving your Facebook credentials to Facebook privately, just as you do any other time you sign in. You are not giving them to us and we cannot see them.

 

 

4. Log in to Facebook in the popup box.

The popup closes and you will now see that you have a token.

 

From now on, any time you run a filter in DataSift against this Managed Source, DataSift will use the token to gain access. It's secure; if you want to stop using the token, you can delete it from DataSift by clicking the red X. Or, in your Apps settings in Facebook, you can revoke it. If you do that, the token becomes useless.

 

Create a Managed Source

Now you can specify what you want to filter for.

5. In the Name field, specify a name for your Managed Source. Here, I've called it "Example".

 

6. Type a search term in the Search box and click Search. Here I'm going to monitor Ferrari cars and merchandise.

DataSift lists all the accounts that match your search term. Select which ones you want to include in your filtering. In this example, I've chosen the candidate with the greatest number of likes.

 

 

8. Click Save

 

Create a CSDL filter for that Managed Source

9. Click the My Managed Sources tab. You will see the source you just defined. Notice that the Start button is orange whereas the other two sources, which I defined before I took this screenshot, have a Stop button. It's important that you don't click Start yet. The first time you click it, DataSift delivers a backlog of posts from the past seven days. You need to create a stream and start a recording to capture those posts otherwise they'll be lost. The next few steps explain how to do that.

 

10. Click on your Managed Source, "Example" in this case. DataSift displays the definition page for the source.

 

 

11. Click How to Use. Now you can grab the CSDL code for this Managed Source. It's a simple one-line filter that uses the source.id target and the unique id for the source you just defined.

 

 

12. Copy the CSDL code to the clipboard:

      source.id == "c07504cc3a324848ba1fb5905287799b"

 

13. Create a filter with that CSDL. You're probably very familiar with this step already. Just click the Create Stream button, paste the CSDL code in from my clipboard, and save it.

 

 

Start recording the output of the filter

Now you need to start recording the output of that filter. Recordings are under the Tasks tab in DataSift.

14. Click Start a Recording.

 

15. Choose the filter that you created in Step 13.

16. Click Start Now and choose and end time for your recording. For this first test, I'd recommend that you don't choose a long duration.

17. Click Continue and then click Start Task.

 

Start the Managed Source

18. Now go back to My Managed Sources and click Start.

Your filter will start to collect data from the source and DataSift will record it automatically.

 

Summary

That's all you need to know to use Managed Sources from the UI. Notice that you didn't even need to write a filter to get started; the platform provided the code for you. And by starting the recording before you ran the filter, you made sure that no data was lost.

For production use, there's a powerful Managed Sources API, plus that step-by-step guide that I mentioned at the beginning of this blog.

Jacek Artymiak's picture
Jacek Artymiak
Updated on Tuesday, 3 September, 2013 - 12:30

The Pull Connector is the latest addition to our growing family of Push connectors. This new Push connector takes its name after the mechanism used to deliver the interactions you filter for: you pull data from our platform instead of us pushing it to you.

Even though the name of this connector might seem to be out of place for a Push connector, it makes sense to classify it as another Push connector, because it uses the same robust Push subsystem that powers other DataSift Push Connectors.

 

 

We designed it specifically for the clients who are firewalled from the public internet and prefer to keep and process data in house. The Pull Connector provides the following benefits:

  • Firewalls and network security policies are no longer an issue.

    With Pull, there is no need to set up public endpoints. It simplifies firewall and network management on your side.

    For example, you no longer need to ask your operations team to loosen up the firewall rules to enable connections from DataSift to a host that will receive data. They will not have to give up a precious public IP address or think of ways of redirecting traffic to a shared IP address.

    Also, a change of the IP address of the host receving data does not require a call to /push/update. 
     
  • Data collection and processing at your own pace.

    The Pull Connector uses the Push data queuing subsystem. Your data is stored for an hour in a Push queue, giving you freedom to collect it as often as you want (up to twice per second per Push subscription ID) and to request as much of it as you want, in batches of up to 20MB.
     
  • You can retrieve data again, if necessary.  

    If you need to request data again, you can go back in time for up to an hour using the queue cursor mechanism. It lets you retrieve data from the queue again in case it gets lost. You have up to one hour to retrieve it, which should give you plenty of time to handle technical problems.

When you combine the robust foundations of the Push subsystem, the freedom to collect data at your own pace, and the ease of setting up a data collection and processing system without having to make changes to your organization's network and security setup, the Pull Connector becomes a very attractive solution.

And we saved the best for last, even though the Pull Connector introduces a new endpoint, /pull, for data collection, we implemented it using the same REST API you are already familiar with. You set it up just like any other Push connector and then call /pull to get your data.

Gareth's picture
Gareth
Updated on Thursday, 9 May, 2013 - 12:17

In late 2012 I wrote about the migration of DataSift's Hadoop cluster to Arista switches but what I didn't mention was that we also moved our real-time systems over to Arista too.

Within the LAN

During our fact-finding trek through the Cisco portfolio we acquired a bunch of 4948 and 3750 switches which were re-purposed into the live platform. Unfortunately, the live platform (as opposed to Hadoop-sourced historical data) would occasionally experience performance issues due to the fan-out design of our distributed architecture amplifying the impact of micro-bursts during high traffic events. Every interaction we receive is augmented with additional meta data such as language detection, sentiment analysis, trend analysis and more. To acquire these values an interaction is tokenized into the relevant parts (for example, a Twitter user name for Klout score, sentences for sentiment analysis, trigrams for language analysis and so on). Each of those tokens is then dispatched to the service endpoints for processing. A stream of 15,000 interactions a second can instantly become 100,000+ additional pieces of data traversing the network which puts load on NICs, switch backplanes and core uplinks.

If a particular request were to fail then precious time is wasted on waiting for the reply, processing the failure and then re-processing the request. To combat this you might duplicate calls to service endpoints (for example, speculative execution in Hadoop parlance) which doubles your chances of success and those ~100,000 streams would become ~200,000 putting further stress on all your infrastructure.

At DataSift we discuss internal platform latency in terms of microseconds and throughput in tens of gigabits so adding an unnecessary callout here or a millisecond extra there isn't acceptable. We want to be as efficient, fast and reliable as possible. When we started looking at ways of improving the performance of the real-time platform it was obvious that many of the arguments that made Arista an obvious choice for Hadoop also meant it was ideal for our real-time system too. The Arista 7050 switches we'd already deployed have some impressive statistics in regards to latency so we needed little more convincing that we were on the right path (although the 1.28 Tbps and 960,000,000 packets per second statistics don't hurt either). For truly low-latency switching at the edge, one would normally look at the 7150 series but from our testing the 7048 switches were well within the performance threshold we wanted and enabled us to standardize our edge. We made use of our failure-intolerant platform design (detailed further below) to move entire cabs at a time over to the Arista 7048 switches with no interruption of service to customers. Once all cabinets were migrated and with no other optimizations at that point we saw an immediate difference in key metrics:
 

 

Simply by deploying Arista switches for our 'real-time' network we decreased augmentation latency from ~15,000µs down to 2200µs. Further optimizations to our stack and how we leverage the Linux kernels myriad options have improved things even more.

Epic Switches are only half the story

One of the great features of the Arista 7048 switches is their deep buffer architecture but in certain circumstances another buffer in the path is that last thing you want. Each buffer potentially adds latency to the system before the upstream can detect the congestion and react to it. The stack needs to be free of bottlenecks to prevent the buffers from filling up and the 7048 switches can provide up to 40Gb/s of throughput to the core which fits nicely with 40 1u servers in a 44u cabinet. With that said we're not ones to waste time and bandwidth by leaving the TOR switch if we don't have to. By pooling together resources into 'cells' we can reduce uplink utilization and decrease the RTT of fan out operations by splitting the workload into per-cabinet pools. 

With intelligent health checks and resource routing coupled with the Aristas non-blocking full wire speed forwarding in the event of a resource pool suffering failures the processing servers can call out cross-rack with very little penalty. 

That's great but I'm on the other side of the Internet

We are confident in our ability to provide low-latency, real-time content that is filtered and augmented. This enables us to publish live latency statistics of a stream consumed being consumed by an EC2 node from the other side of the planet on our status site. We can be this confident because we control and manage every aspect of our platform from influencing how data traverses the Internet to reach us, our routers, our switches all the way down to the SSD chipset or SAS drive spindle speed in the servers. (You can't say that if you're on someone's public cloud!)

User Latency
(They could be next door to a social platform DC or over in Antarctica)
10ms - 150ms
+
Source Platform Processing time
(For example, the time taken for Facebook or Twitter to process & send it on)
???ms
+
Trans-Atlantic fiberoptics
(For example, San Jose to our furthest European processing node)
~150ms
+
Trans-Pacific fiberoptics
(For example, from a European processing node to a customer in Japan)
~150ms
=
  ~500ms

When dealing with social data on a global scale there can be a lot of performance uncertainty with under-sea fiber cuts, carrier issues and entire IX outages but we can rest assured that once that data hits our edge we know we can process it with low latencies and high throughput. In conclusion I've once again been impressed by Arista and would whole heartedly recommend their switches to anyone else working with high volume, latency sensitive data. 

Reading List:

Arista switches were already a joy to work with (access to bash on a switch, what's not to love?) but Gary's insights and advice makes it all the better. Arista Warrior - Gary A. Donahue

Even with all the epicness of this hardware, if you're lazy with how you treat the steps your data goes through before it becomes a frame on the switch you're gonna have a bad time so for heavy duty reading The Linux TCP/IP stack book may help. The Linux TCP/IP Stack: Networking for Embedded Systems - Thomas F Herbert

Ed Stenson's picture
Ed Stenson
Updated on Thursday, 25 April, 2013 - 16:27

Have you tried our Query Builder yet? It's a visual tool that makes it easy for newcomers to get started with DataSift quickly, before they even begin to learn our query language, CSDL. Despite its simplicity, the Query Builder offers very nearly all the features on offer in the full language. It includes every CSDL operator and logical operator, together with very nearly all the targets and augmentations. Recently we added the ability to use parentheses and, with the latest release, we've added the NOT logical operator.

Let me give you an example. In CSDL, you can write a filter that includes one keyword and excludes another like this:

 

To date, it has not been possible to perform logical inversion using the Query Builder but now you can do it with a single click:

Adding a NOT in the Query Builder

To create rules that use NOT:

  1. Click Create New Filter.
     
  2. Choose IMDb -> Title.
  3. Choose Contains words as your operator.
     
  4. Type "Star" as the filter keyword.
     
  5. Click Save.
     
  6. Click Create New Filter again and build a second rule that looks for "Trek" in the IMDb title.
     
  7. Click Save and your Query Builder screen will show the two rules like this:


    The will filter for titles that include "Trek" AND "Star". We need to adjust it to filter for "Trek" AND NOT "Star".
     
  8. Click Advanced.

    Notice how the two rules now have numbers?


     
  9. Click NOT.



    The Query Builder adds the NOT in front of rule 1. This is exactly what we want, because rule 1 filters for "Star". If you wanted to apply the NOT to the rule for "Trek" you could drag and drop it in front of the "2".

  10. Click Save and Preview. The Query Builder saves your work and automatically generates code.

 

// JCSDL_MASTER 4bce1d2f67166ea38e0875cc79750c85 !1&2

// JCSDL_VERSION 2.1

NOT

// JCSDL_START 8a1624733bb708222fab239bcb5d8aaf imdb.title,contains_any,25-4 1

imdb.title contains_any "Star"

// JCSDL_END

AND

// JCSDL_START ea352777ca8a4b1e80c3f4cb60e22dfc imdb.title,contains_any,25-4 2

imdb.title contains_any "Trek"

// JCSDL_END

// JCSDL_MASTER_END

 

Several of the lines are commented out because they contain internal information for the Query Builder itself.  If we remove those lines, we're left with the original CSDL that I included at the beginning of this blog.

 

NOT imdb.title contains_any "Star"

AND

imdb.title contains_any "Trek"

 

Using NOT with parentheses

Let's look at an example that uses parentheses as well as NOT. Suppose you wanted to filter for blogs about mention the NASDAQ, but exclude mentions of Microsoft and Intel, two of the tech-laden index's most well-known components:

  1. Click Create New Filter.
     
  2. Choose Blog -> Content.
     
  3. Choose Contains words as the operator.
     
  4. Add "NASDAQ" as the filter keyword.
     
  5. Click Save.
     
  6. Click Create New Filter again and build a second rule that looks for "Microsoft" in the blog content.
     
  7. Finally, build one more rule that looks for "Intel" in the blog content.



    Now we need to add logical operators, parentheses, and a NOT operator.
     
  8. Click Advanced.
     
  9. Select the parentheses.



    By default, the Query Builder places the parentheses around your entire query definition.


     
  10. Drag the left parenthesis so that the parentheses surround rules 2 and 3.
     
  11. Click the second AND operator and it will change to OR.


     
  12. We need to make one final modification to make the query exclude Microsoft and Intel. Click NOT and drag it so that it operates on the clause defined within the parentheses.


     
  13. Click Save and Preview. The Query Builder saves your work and automatically generates code.

 

// JCSDL_MASTER c24e74b9aa6c3aed7c09d36aae51b661 1&!(2|3)

// JCSDL_VERSION 2.1

// JCSDL_START cfc90f784c8274788a2bb3984ba42ee1 blog.content,contains_any,27-6 1

blog.content contains_any "NASDAQ"

// JCSDL_END

AND  NOT (

// JCSDL_START 13a0d4a23c57744568df3c58dde08ec4 blog.content,contains_any,27-9 2

blog.content contains_any "Microsoft"

// JCSDL_END

OR

// JCSDL_START 18b98e1b4d1b7c312d6ac8827be9350c blog.content,contains_any,27-5 3

blog.content contains_any "Intel"

// JCSDL_END

)

// JCSDL_MASTER_END

 

If we strip out the comments, the code becomes easy to read:

 

blog.content contains_any "NASDAQ"

AND  NOT (

blog.content contains_any "Microsoft"

OR

blog.content contains_any "Intel"

)

 

Summary

To try this new functionality out, you'll need to grab a copy of the Query Builder. It's Open Source and you can find it on Github. We provide full documentation to show you how to embed the Query Builder on your own pages.

Pages

Subscribe to Datasift Documentation Blog