In this guide we'll look at how you can migrate your Gnip-based solution to the DataSift platform.
This guide assumes you have been using Gnip's PowerTrack product. As a PowerTrack user you can easily migrate your solution to DataSift STREAM.
Before you get started take a look at our What is DataSift STREAM? page.
Here you'll learn the key features of the DataSift platform and learn terms we'll use in this guide.
Accessing the DataSift platform
You can access the DataSift platform using the dashboard and our REST API. The dashboard is a great place to get started with DataSift and to configure your account, but for production work we strongly recommend you use the API as not all features are available through the dashboard.
To access the dashboard visit the login page and enter your username and password. You can use the dashboard to:
- view your account details.
- view reports for your usage of the platform.
- enable & configure data sources.
- create and preview data filters.
When you access the API you will need to provide your username and api key. You can find these details on the dashboard home page:
We recommend you use one of our client libraries to access the API. The client libraries are supported by our engineers and cover all platform features.
To learn the basics of using the platform take a look at our quick start guide.
Configuring data sources
To access data using the DataSift platform you first need to enable a data source in your DataSift account.
When working with Gnip you will have had a product for each data source, for example Powertrack for Tumblr. On the DataSift platform the equivalent is a data source.
In the dashboard you can enable sources that are included in your account package. Note that when you write a filter and start to consume data your filter will by default run against all of your enabled sources. We recommend you start with one source such as Tumblr.
To enable the Tumblr data source:
Select Data Sources in the dashboard menu.
Click on the Tumblr source.
- Click the Activate button to activate the source.
You will need to complete the form and agree to the terms & conditions for the data source.
Configuring data augmentations
Augmentations add extra value to the raw data that arrives from data sources. Augmentations are similar to Enrichments on Gnip's platform. For example the language augmentation detects the language a post is written in.
On the DataSift platform you need to enable augmentations that you'd like to be run against your data sources. You can do so in the dashboard. Enabled augmentations will be run against all of the sources you have enabled in your account.
To enable the language augmentation:
In the dashboard click on Data Sources in the top menu.
Click on the Language Detection augmentation.
Click Activate & Sign License.
- Complete the signature box and click Agree.
You can enable additional augmentations following the same steps.
Defining a filter
Now that you have enabled a data source and selected data augmentations the next step is to create a filter to select data from the source for delivery to your application.
Filters are the equivalent to rules in PowerTrack but there are significant differences that you need to be aware of.
- With PowerTrack you have one set of rules for a product.
- With STREAM a filter runs across all data sources you have enabled.
- With PowerTrack all data for a product is delivered in one stream.
- With STREAM data is delivered per filter.
DataSift STREAM allows you to create any number of filters. You can deliver data from up to 200 filters in parallel at any one time.
You specify a filter using CSDL. CSDL is DataSift's own language for working with semi-structured data.
As a quick example of CSDL, here is a filter that selects data from Tumblr only that is written in English, and relates to fashion brands:
// Filter to Tumblr content in English interaction.type == "tumblr" AND language.tag == "en" AND ( // Mentions of brands in content interaction.content contains_any "Calvin Klein, GQ, Adidas" OR // Content reblogged from brand blogs tumblr.reblogged.root.url contains_any "http://calvinklein.tumblr.com/,http://gq.tumblr.com/,http://adidasoriginals.tumblr.com/" OR // Content that links to brand websites links.domain in "calvinklein.com,gq.com,adidas.com" )
Each condition in CSDL consists of a target (data field), operator and a value (or set of values) to test against. You can combine any number of rules using logical operators and group them using parenthesis. Each data source has a set of targets that relate to the data it provides, for example Tumblr has this set of targets.
To translate your PowerTrack rules to a CSDL filter see our translating PowerTrack rules to CSDL guide.
Compiling and previewing a filter
Once you have defined a filter you need to compile it so that you can consume the data it selects. Compiling a filter validates its syntax and returns a hash that you use to identify the filter. You use the hash to run the filter and deliver the resulting data to your application.
Here we will show you how to compile and preview a filter using the dashboard. To learn how to do so using the API take a look at the getting started guide for your prefered language.
To compile a filter using the dashboard:
Click on Filters in the top menu.
Click Create a New Filter.
Enter a Name for your filter and select CSDL Code Editor as your choice of editor.
Click Start Editing.
In the editor area enter the CSDL for your filter.
Click Save & Close to validate and compile the filter.
Click Consume via API to see the hash for your filter.
- Go back a page and click Live Preview to see a preview of the data you would receive if you ran the filter.
Delivering data to your application
DataSift STREAM provides a number of options to deliver data to your application.
With PowerTrack you receive data using the related streaming API. We also provide a streaming API to deliver data. Data is delivered as soon as it has been processed by the platform so is perfect for use cases that require real time data.
In addition we offer a range of Push connectors. Push connectors help you build more robust applications as they guarantee data delivery. Data is buffered for up to 60 minutes and delivery is repeated until we have acknowledgement that you've received the data. You can choose from a wide range of connectors that deliver data directly to common destinations including MySQL, AWS S3 and MongoDB. You can use the Pull connector to pull data from the platform if your application is behind a strict firewall.
You can see the streaming API in action by running a CURL command. You need to specify the filter you'd like to stream data from (using the hash you received from compilation) and provide your authentication API details.
curl https://stream.datasift.com/[filter hash] -H 'Auth:[username]:[apikey]'
If you compile a filter using the dashboard you can copy the CURL command from the Consume via API screen.
To see how to access the streaming API using one of our client libraries take a look at the getting started guide for your prefered language.
As the streaming API is equivalent to Gnip's delivery mechanism this will be the quickest way for you to get started. We recommend that when you can find time you consider switching to using Push delivery.
You are charged for amount of data you consume from the platform and the number of DPUs (DataSift processing units) your filters cost to run. Both of these charges are only applied for running filters. You can read more about charges on our billing page.
Maintaining your solution
Once you have you solution deployed it is important that you monitor your usage of the platform and maintain your filters as your requirements change.
It is inevitable that you will need to update the CSDL for your filters as you gain more customers and your existing customers' requirements change. You can react to changes by adjusting the CSDL for your filter, recompiling the filter and switching data delivery to consume from the new filter. To learn more take a look at our blog post on updating filters on-the-fly.
You can monitor your usage of the platform using the Usage Statistics page in the dashboard. You can also hit the /usage API endpoint to get the same details through the API. When you update your filters you can use the /dpu endpoint to calculate the cost of your updated filter.
Now that you have migrated your solution take a look at some of the features DataSift STREAM offers which can make your solution even better:
- Data sources - Add more data sources to your solution
- Augmentations - Add further enrichments to data
- CSDL language - Write more precise filters
- Tumblr targets - Filter against further fields in the Tumblr data stream
- Classifying data - Add tags and scores to data to give it extra value
- Push delivery - Consider moving to push delivery to help you build a more robust solution