STREAM is a DataSift product which lets you build applications that consume and analyze data from social, blog, and news sources.
With STREAM you choose to access data from a number of sources, select augmentations to enrich the source data, write filters to select and classify the data, and then deliver the data to your application.
Accessing data sources
With DataSift STREAM you can choose from a number of data sources. Sources are split into two groups; public sources and managed sources.
Public sources are feeds from social networks (for example Tumblr), blogging platforms (for example Wordpress), and news providers (such as Lexis Nexis). Data from these sources is streamed into the platform, in most cases in real time.
Managed sources are data sources you configure to access social networks (such as Facebook) on your behalf using your account credentials. When you have configured a managed source the platform will manage polling for new data from the network for you. The data ingested into the platform is private to you and gives you the same perspective as if you were logged in to the social network yourself.
When you have selected and set up your chosen sources you can enrich, filter, classify, and deliver data from these sources simultaneously to your application.
To help you get more from source data, the platform provides a number of augmentations. Augmentations enrich the raw data to add additional value and to normalize the data so it is easier to work with.
Firstly, the interaction augmentation is applied to data from all sources. This augmentation normalizes data so that you can work with it consistently, regardless of the source it arrived from.
Secondly, you can choose from a range of additional augmentations to apply to your data sources. For example the Links augmentation expands links to their final destination and gives you access to metadata for the link, such as the title of the page.
Applying augmentations to your data sources gives you richer data to work with during filtering and classification, and richer data to add to your end application.
The goal of data filtering is to select exactly the data you need from high-volume, noisy data sources for delivery to your application.
On the platform you create filters to select data from your selected sources using CSDL. CSDL is DataSift's own language for processing semi-structured streaming data.
With CSDL you have access to a wide range of operators, including advanced text operators. You can both write simple powerful filters and filters of limitless complexity.
Each piece of data (post, comment, blog etc.) that arrives from sources is called an interaction. You write your filters to operate against targets which are the data fields of the interactions. For example, if you have access to the Tumblr data source you could filter to posts that contain certain keywords, or posts that have been reblogged from a specific source blog. You can combine many conditions to build very precise filters.
You can also reference metadata added by augmentations in your filters. For example you could apply the Links augmentation and filter for only interactions that contain links to a specific domain. The interaction augmentation allows you to write one filter which works across all of your chosen data sources.
Interactions that match your filter are available to be delivered to your application.
Using classification you can add your own unique value to data before it is delivered to your application.
You classify data by adding classification rules to your filter. Classification rules are also written in CSDL. Any interactions that match your filter are processed through your classification rules which add appropriate metadata that is added to your delivered data.
You can use classification to:
- Tag key features of interactions - For example you could tag interactions that mention brands and products.
- Run machine-learned classifiers - For example you could train a machine-learned classifier to classify an author's intent to buy a product.
You can also make use our off-the-shelf classifiers to add extra value to your data.
Delivering data to your application
DataSift STREAM provides a number of options to deliver data to your application.
Our range of Push connectors help you build robust solutions. When you use a Push connector, we buffer up to 60 minutes worth of data for you to help guarantee data delivery. You can choose from a wide range of connectors that deliver data directly to common destinations including MySQL, AWS S3, and MongoDB. You can use the Pull connector to pull data from the platform if your application is behind a strict firewall.
If you'd prefer to receive data in real time (for applications where timing is critical) you can use the streaming API. Data will be delivered to your application as soon as it is has been processed by the platform.
Get started with STREAM
To get started with STREAM take a look at our quick start guides.