Blog posts in Announcements

Richard Caudle's picture

Introducing Keyword Relationship Models

Identifying and expanding on keywords and terms is a key challenge when filtering, classifying and analyzing text data. We're always looking at how we can make this challenge easier. One area we've been researching is finding relationships between words using word2vec.

Today we've released a tool which allows you to explore relationships between words. We've also created our first Keyword Relationship Model for you to explore. This model represents over three million unique terms in a four hundred dimensional space. The example model has been built by processing almost 4 billion social interactions in English posted during March and April 2015.

The model is extremely useful as it reflects how language and keywords are really used by people. You can use the model (via the explorer tool) to improve your filters, classification rules and analysis by making sure you include the key terms, phrases and misspellings used around your topic and keep track of them as they evolve over time. Similarity between terms, rather than being statically defined based on a structured taxonomy, is calculated by analysis of how people express themselves on social networks.

In the tool you can interact with the model by exploring terms, their similarity and how they relate to each other. You can start with any word, formal and informal, misspellings and neologisms, as long as:

  • It is composed of a single term (only unigrams are currently included in the model)
  • It appears at least forty times in the dataset

For example you could try exploring:

  • Common terms - thanks, wish, love
  • Companies - starbucks, nikon, bmw
  • Products - f150, coke, applewatch
  • People - obama,  #ericschmidt, #nicolascage
  • Places - london, france, beach
  • Events - #oscar, eurovision, superbowl
  • Occasions - holiday, wedding, birthday
  • Hashtags - #data, #food, #fail

Note that punctuation has been removed, therefore if you want to use “f-150” the term “f150” has to be specified.

After you choose an initial seed term the list of top 10 related words are shown, sorted by similarity: the first term being the most similar and the last term being the least similar in the list to the initial term.

You can then click on any of the related terms to see the terms it relates to and so on, allowing you to explore the model to many levels of depth. At any point you can hover over the word to the see the similarity (on a scale of -1 to 1) between that word and the initial seed.

You can read more here about how we built the model using word2vec.

 

Richard Caudle's picture

Open Data Processing for Twitter - Now Available

Following Twitter's announcement to terminate their partnership with DataSift, we've been working hard to build a solution to help our customers fill gaps left by the transition to Gnip. We've now released an open-source connector that will take data from the Gnip API and feed this into the DataSift platform. Hopefully this component will make your transition as pain-free as possible.

DataSift's open-source connector

Recently Twitter terminated their partnership agreement with DataSift, meaning that as of August 13th 2015, companies will need to license Twitter data directly from Gnip. Although Gnip can provide data; their platform does not give you the same processing features as the DataSift platform. We've outlined the gaps in a previous post.

To help make your transition as smooth as possible, and to help you to continue to use our platform features; today we've released an open-source connector that takes data from the Gnip API and delivers the data into our platform. 

The new “DataSift Connector” is:

  • Highly configurable, allowing you to tailor it for your exact solution
  • Pre-configured and easy to deploy to common hosting platforms, requiring minimal setup for common scenarios
  • Supports tweets, re-tweets and deletes

When data from Gnip is ingested the DataSift platform transforms the schema to match the Twitter schema you've been working with until now, so that minimal changes (if any) are required for your existing filters and classifiers.

Note the connector doesn't currently support historic data integration. We are looking to build this support very soon.

How does it work?

The connector is a collection of components that you can configure, tailor, use all or a subset of to bring Gnip data into our platform.

The connector has been designed this way as it forms part of future roadmap for allowing you to ingest any data into our platform. We'll talk more about this very soon!

The key components of the connector are:

  • Gnip reader - Connects to the Gnip Powertrack API, receiving streaming data and adds data to the buffer
  • Buffer - Buffers data between the reader and writer, handling bursts of data and ensuring no data loss
  • DataSift writer - Connects to the DataSift API uploading data from the buffer into our platform

Out-of-the-box we provide configurations that combine these components to give you an immediate solution. You can of course take our code and modify it for your needs and take and leave each component as you see fit.

For instance you might want to modify the code to write to your own metrics collector. You can also replace the Gnip Reader with your own reader. This is useful if you are storing the Gnip data in mongoDB or similar - all you’d have to do is to build a mongoDB Reader and ensure it sends data to the buffer. 

Using the connector in your solution

We've invested time in making the connector easy to test, configure and deploy. 

For testing and deployment the solution includes:

  • Packer configurations - Deploy the solution using the Packer configurations we supply to build EC2 AMI and Docker images.
  • Vagrant support - We supply a vagrantfile so that you can clone the code repository locally and spin up a local instance for testing.

Of course you'll want to monitor the connector when you have it running in production. To help you do so the solution includes:

  • Metrics - The solution logs metrics to statsd which you can easily integrate into your infrastructure
  • Grafana dashboard - A default dashboard configuration for monitoring key metrics

To allow you to run a robust production solution the DataSift ingestion endpoint deduplicates incoming data. This allows you to run multiple instances of the connector giving you redundancy.

You might consider running an instance of the connector which dumps data to the buffer, then writing the data from the buffer into a data store for safe keeping.

Using ingested Twitter data in filters and classifiers

When you ingest data from Gnip it will be transformed to match the current Twitter schema you've been using in your filters to date.

The only difference you'll see is that the interaction.type target will be set to 'twitter_gnip'.

So if you have filters specifically filtering for Twitter data using this target you'll need to update your filter to instead say:

interaction.type == "twitter_gnip"

Twitter data from Gnip does include additional fields that aren't included in the DataSift Twitter schema. These fields are maintained through the processing pipeline. You can find these under the unmapped collection at the top-level of the JSON output for the interaction.

Migration timeline

Hopefully the connector can go a long way to completing your migration. 

We recommend you get started testing the new connector right away and get in touch with your feedback. You can test the connector and use all of our platform features. Remember though the ingestion endpoint is currently rate limited to allow ingestion of 37MB per hour, enough to ingest 18,000 Tweets an hour (or 5 a second).

Early July we will be raising the rate limits so you can test the solution with production data volumes.

Before the 13th August you will need to have switched to Gnip to receive Twitter data.

Getting started

You can get started with the connector immediately. The code for the connector is available on GitHub here.

To enable access to the ingestion endpoint get in touch with your sales representative.

 

Richard Caudle's picture

Transitioning to GNIP: Feature Gaps vs DataSift

Twitter has terminated their partnership agreement with DataSift, meaning that as of August 13th 2015, companies will need to license Twitter data directly from GNIP. 
 
From our analysis, 80% of our customers leverage capabilities which are absent in GNIP. The goal of this post is to highlight the main functional areas that will be impacted by transitioning to GNIP, to enable our customers to evaluate the features that may need to deprecate from their products, or identify the additional development work they will need to fill these gaps. 

Innovations Deprecated: What Will I Miss?

If you're a current DataSift customer consuming Twitter data there are features you may be relying on which (to the best of our knowledge) are not be available on the GNIP platform. 
 
Filter Size & Complexity DataSift allows you to create complex filters. To reproduce the same filtering you may need to break up your filters and perform your own second level of filtering after you have received the data. 
Delivery & Integration DataSift provides connectors that guarantee data delivery, and are ready made for integration to common data stores and applications. Without these connectors you'll need to build directly against their API and handle connection drops & backfill yourself.
Classification DataSift VEDO allows you to tag, score and apply machine learning to data before it reaches your application. To the best of our knowledge there is currently no equivalent feature in GNIP.
Augmentations DataSift provides rich augmentations such as for links, sentiment and demographics. GNIP provides a number of augmentations, but not for sentiment or demographics.
Filtering Features GNIP restricts you to a set number of data fields and to fewer operators. You may need to perform your own second level of filtering once you receive the data.
Asian Language Support DataSift provides support for Japanese and Chinese language tokenization, without this you are left with simple substring matching which can be inaccurate.

 

Filter Size & Complexity

If you've been using DataSift for a little while you'll no doubt have created some complex filters for your projects.
 
It's common for our customers to need to filter by lists of user IDs or names, for example:
 
 
Or to use lists of include and exclude keywords and phrases:
 
 
With DataSift you can include 10,000s of IDs, words and phrases in a filter, enabling you to track long lists of users and be very precise with your content matching.
 
DataSift filters are individually limited to 1Mb in size, but you can filters in to each other using the stream keyword so in effect creating filters of limitless complexity and size.
 
GNIP limits you to 2048 characters for your filter. To reproduce this level of filtering complexity you will need to create many separate filters on GNIP to retrieve the same data, then post-process the data once you receive it to remove duplicates and build a combined data set. 
 

The inability to create complex include/exclude statements in your GNIP rules will mean:

  • Expect data costs to increase as the ability to filter with precision is reduced. 
  • Implement post-processing capabilities to reduce the potentially noisy data set you receive into a refined data set for analysis.

Delivery & Integration

To make it easier to integrate Twitter data into your analytical applications, DataSift provides a number of Data Delivery Connectors to common data stores and applications, all which buffer delivery for at least 1 hour and guarantee no data is lost in transit.
 
On top of this DataSift supports delivery of both tabular and hierarchical data structures. For stores such as MongoDB that are designed for JSON data, DataSift delivers data directly. For stores such as MySQL we allow you to map data to your tables, saving you from implementing this painful step.
 
Also, if you're sitting behind a highly-secured firewall you can also make use of the Pull connector, which buffers data for you until you request a delta set.
 
GNIP offer access to data through their APIs, but do not provide ready made connectors. Instead you need to build against their API and map data to your destination yourself.
 
GNIP do offer a level of protection through their Backfill feature, this provides 5 minutes of buffering, rather than 60 minutes. GNIP also allow you to run multiple redundant connections to try to protect against connection loss. Here you will need to detect connection drops, re-establish connections and to deduplicate the data you receive.

Classification (VEDO)

DataSift VEDO allows you to understand data by classifying it using tagging and scoring. VEDO allows you to classify data after it has been filtered so that you receive data marked up for your application, removing the need for you to post-process the data.
 
As a quick example, you can use VEDO to identify Dow Jones companies:
 
 
Or you can take things further by applying machine learning to data, to create your own emotion and sentiment classifiers trained to your domain.
 
There is no equivalent to VEDO when using GNIP. To achieve the same outcome you will need to post-processing the data yourself when you've received it from the GNIP platform.

Augmentations

DataSift provides a rich set of augmentations which you can apply to your data. 
 
Augmentations are applied to data before it reaches filtering, allowing you to filter and classify against the augmented results. 

Link Pre-Processing

One of the most powerful augmentations DataSift offers is the Links Augmentation
 
The DataSift Links augmentation not only unravels the link, but also pulls back metadata such as the page's title and meta tags. For example a t.co link expands to:
 
 
With DataSift you can therefore filter and classify against titles of pages, Twitter Card metadata, Facebook Open Graph tags, in fact a huge number of data fields.
 
GNIP offer an augmentation to unravel links to their original location too. For example a t.co link will be unravelled as follows:
 
 
The GNIP augmentation does not give you the same level of detail for links. The absence of Links Title is the biggest gap. Using GNIP, for example, you cannot create a rule to return news stories being shared on Twitter that mention “Apple” in the page title. Unfortunately, you cannot solve this problem with “post processing” so this is a feature that you will have to deprecate from your application if it is being used. 

Sentiment

Another powerful augmentation provided by the DataSift platform is sentiment analysis.
 
When a tweet arrives in the platform the augmentation attaches a sentiment score:
 
 
You can also use VEDO to run a custom machine-learned classifier to produce a domain-specific sentiment score.
 
If you rely on sentiment scoring in your solution, there is currently no equivalent augmentation in GNIP. You will need to investigate 3rd party solutions to fill this gap.

Demographics

DataSift Demographics expands Twitter data to give you demographic details. Our Gender augmentation determines the author's sex and can be used with all tweets. 
 
Our Demographics takes you much further giving details such as age, profession, interests and brand alliances for anonymised tweets.
 
Demographics are extremely valuable when trying to understand your audience or target market. 
 
 
GNIP does not currently offer a demographics product. You would need to investigate 3rd party providers to fill this gap.

VEDO FOCUS

One of our recent product launches has been VEDO FOCUS. FOCUS classifies all Tweets in the Twitter firehose against a deep taxonomy of 450,000 categories. This taxonomy is constantly updated to include rapidly changing topics from breaking news to the latest music artists.
 
With FOCUS enabled, rather than spending time tuning and perfecting your filter, you can write extremely simple but powerful queries. FOCUS helps developers build more accurate filters much faster.
 
For instance to capture all discussion around the automotive industry:
 
 
There is currently no equivalent product when working with GNIP. You would need to investigate 3rd party providers to fill this gap.

Historic Archive

Both GNIP and DataSift offer historical Twitter data access, however the workflow for historic queries is significantly different.
 
DataSift allows you to run historic queries, using the same filtering & classification features you use for real-time stream access. The data is delivered to you through one of our push connectors, so is extremely easy to integrate. 
 
GNIP's historic access is limited by the same filtering and classification gaps as for real-time access. The resulting data is stored in a collection of files, each representing 10 mins of the query, which you then need to download, combine and store in your database.  
 
In addition, if you are relying on augmentations on tweets such as sentiment and link expansion, this data is not available in the GNIP historic archive, because of the differences in the augmentations each platform provides.
 

Filtering Features

When working with DataSift you have access to a large range of data fields and CSDL operators to craft your filters and classifiers. 

Data Fields

When creating DataSift filters you can work against any data field in a Tweet. There are a huge number of fields in a single Tweet (over 100) all which can be very useful across use cases, all of which you can work with. Using CSDL you can consistently use the same operators across any of these data fields.
 
For instance, maybe you'd like to filter to only tweets that have been retweeted at least 1000 times, with content in either the body of the tweet or in the title of a shared link. In DataSift you can say:
 
 
The contains_any operator can be applied to any text field. The twitter.retweet.count field is just one of many targets not available in GNIP to work with.
 
GNIP does offer a fixed list of fields to work with, but the list is more limited.
 
When migrating to GNIP you will need to assess each of your filters to see which fields are possible to work with in GNIP. Any that are not you will need to reproduce in post processing once you receive the data in your application.

Geo Polygons

Geo polygons are a a great way to restrict your filtering to an exact location. They take you beyond simple location box or radius filtering allowing you to define complex real world locations, such as US states.
 
For example, you can define countries and states to filter for. This example filters precisely for tweets posted in California:
 
 
GNIP supports geo box and radius searches, but not currently geo polygons.

Regular Expressions

The DataSift regular expression operator is very useful for defining advanced filters. With a regular expression you can move way beyond simple text matching.
 
For instance with a DataSift filter you can match cashtags with the following filter:
 
 
GNIP does not support regular expressions. 

Case Sensitivity

The DataSift case sensitivity switch is a great feature when you need precise matching of content. 
 
For instance, when you're trying to match mentions of brands, case sensitivity is important:
 
 
All operators in GNIP are case insensitive so you cannot reproduce this feature, aside from filtering for 'google it' and post-processing the data you receive.

Wildcards

Wildcards are a great way to 'fuzzy match' content. Wildcards are supported on the DataSift platform using the wildcard operator.
 
As a quick example, wildcards are great when  you're trying to match a commonly misspelled word:
 
Or filter for variations of a word such as print:
 
 
GNIP doesn't support wildcard matching. 
 

Tokenization for Japanese and Chinese

We're all keen to tap into a wide variety of markets for our analysis. Two key markets are Japan and China, which present a difficult technical challenge because of the language structure. 
 
As a quick example, the brand Samsung in Chinese is written as 三星. Because there are no spaces to break up words as there would be in English, meanings of characters are based upon the context of surrounding symbols. So performing a substring match on the characters would match both of these phrases:
 
我爱我新的三星电视!  (I love my new Samsung TV!)
我已经等我的包裹三星期了!  (I've been waiting three weeks for my parcel to arrive!')
 
The trick to working with the languages is tokenizing the content, in effect adding spaces where appropriate, based upon the surrounding content. In DataSift this is solved for you by tokenization support. With this support you can build precise filters just as you would for European languages. 
 
There is no equivalent tokenization in GNIP for these languages at present. You can look to produce the same effect using substring matches. This is tricky to get right and might take some time to perfect.
 

 

Richard Caudle's picture

DataSift PYLON - The value of unified data processing

Applying CSDL and VEDO to Facebook topic data

You might have seen our recent announcement which enables developers to gain insights from Facebook topic data. No doubt you're eager to learn more! In this post we'll look at how easy it is to take CSDL you've fine-tuned for filtering data from networks such as Twitter or Tumblr, and apply this in PYLON both for not only filtering but also for classifying Facebook topic data. This demonstrates the simplicity of using a unified data processing platform. 
 
Before we jump in too deep, at this point you're probably keen to know exactly what PYLON is. Let's give you a quick intro…

What is PYLON?

PYLON is a new API that enables a privacy-first approach to analyzing Facebook posts and engagement data. Using the PYLON API, for the first time you can build insights from Facebook's topic data.
 
With PYLON you can analyze the billions of posts and media engagement that takes place on Facebook every day, but respect the privacy of individual people using Facebook. 
 
The sources of aggregate and anonymized information that can be extracted from Facebook topic data can be broadly categorized as posts and engagement data:
 
  • Posts on pages: Content, Topics, Links and Hashtags
  • Engagement data: Comment, Shares, Likes

Privacy-First

PYLON gives you access to Facebook topic data, but it also protects the identity of individuals on Facebook. Personally identifiable information (PII) is never exposed.

  • You receive statistical summarized results, never individual posts.
  • A minimum audience-threshold of 100 is applied to any analysis you perform to protect from individual-level analysis.
  • Data is processed within Facebook’s own data centre. Raw data never leaves Facebook’s servers.
  • Interaction data is only available for analysis for up to 30 days, after which time it is deleted from the service.

So with PYLON you can generate insights that before now were not possible, yet in a way that ensure privacy for Facebook users!

PYLON Workflow

If you've worked with DataSift before you'll be used to the Stream workflow:
 
  • Create filters in CSDL against your enabled data sources
  • Add classification rules to your filters using VEDO
  • Deliver raw, classified data to your chosen destination
 
PYLON is different, as to maintain privacy the output is delivered as summary results:
 
  • Create a filter that matches interactions on Facebook
  • Include classification rules to add value to the data
  • Record output from this filter into a private index
  • Submit analysis queries to the index to receive analysis results
 
 
The data you filter is recorded to a PYLON index only you have access to. You access your index using analysis queries which return summary statistical results from Facebook topic data.
 

Apply your CSDL expertise to Facebook topic data

So until now if you're a DataSift customer you'll have been using our Stream product, this provides access to filter from a variety of social data sources such as Twitter, Tumblr, and blogs.  
 
In this scenario, you'll have created filters, for example to capture tweets about popular mobile games:
 
( \n\tinteraction.content contains_any "Game of War,Boom Beach,Pet Rescue,Candy Crush,Don't Tap, \n2048,FarmVille 2" \n\tOR interaction.content contains_near "Clash,Clans:5" \n\tOR interaction.content contains_near "Puzzle,Dragons:5" \n\tOR interaction.content contains_all "step,white,tile" \n) \nAND NOT interaction.content wildcard "cheat*" \nAND NOT interaction.content wildcard "hack*"
 
Even if you've not familiar with CSDL syntax, you can see how easy it is to work across different data fields (which we call targets), make use of operators such as contains_any and combine many conditions with boolean operators.
 

Applying CSDL to Facebook topic data

With PYLON just like Stream your first step is to create a filter for the interactions you'd like to record to your index.
 
The best thing about PYLON is that under the hood it uses exactly the same filtering engine to filter Facebook topic data stream, just as is used to filter the Twitter firehose.
 
So, let's take our example CSDL from above. If we'd like to index the Facebook topic data which covers exactly the same criteria as we used with Twitter:
 
( \n\tinteraction.content contains_any "Game of War,Boom Beach,Pet Rescue,Candy Crush,Don't Tap, \n2048,FarmVille 2" \n\tOR interaction.content contains_near "Clash,Clans:5" \n\tOR interaction.content contains_near "Puzzle,Dragons:5" \n\tOR interaction.content contains_all "step,white,tile" \n) \nAND NOT interaction.content wildcard "cheat*" \nAND NOT interaction.content wildcard "hack*"
 
Yes, that's right - the CSDL is exactly the same! 
 
This works because common data fields in both sources are mapped to the interaction namespace. In this case we're making heavy use of interaction.content which is the content of the post or tweet. Also regardless of the data field you're filtering upon, the operators such as contains_any work in exactly the same way too.
 
Of course, you can also utilise the unique data available from each network. For example, Facebook topic data contains audience demographics and topic-data which is not present in Twitter.  With this, you could expand your CSDL to focus your analysis on for males talking about cars who live in California.
 
fb.author.gender == "male" and fb.author.region == "California" \nand fb.topics.category == "Cars"
 
With Facebook topic data you can filter on demographics, but what's even better is that PYLON ensures privacy for Facebook users as returned results are anonymous, aggregated summaries.
 

Classifying with VEDO

The power of CSDL doesn't stop at filtering to identify data for analysis. You can use the same language, operators and data fields to classify your data using VEDO.
 
Classifying data this way is particularly important when using PYLON as privacy guards prevent you from accessing raw data. By using VEDO with PYLON you can add extra value to the data before it is saved in your index, and then make use of this extra data when submitting your analysis queries.
 
Let's build on our example above by adding tags to data that has been filtered. We can do so by adding tag rules to our filter:
 
tag.game "Candy Crush" { interaction.content contains "Candy Crush"} \ntag.game "Game of War" { interaction.content contains "Game of War"} \ntag.game "Boom Beach" { interaction.content contains "Boom Beach"} \ntag.game "Pet Rescue" { interaction.content contains "Pet Rescue"} \ntag.game "Don't Tap" { interaction.content contains "Don't Tap"} \ntag.game "2048" { interaction.content contains "2048"} \ntag.game "FarmVille 2" { interaction.content contains "FarmVille 2"} \ntag.game "Clash of the Clans" { interaction.content contains_near "Clash,Clans:5"} \ntag.game "Puzzles & Dragons" { interaction.content contains_near "Puzzles,Dragons:5"} \ntag.game "Don't Step On The White Tile" { interaction.content contains_all "step,white,tile" }
 
Again this code can be run on Twitter data using the Stream product or against Facebook data in PYLON. 
 
We can take things further by applying machine learned classifiers. For instance we could create a custom sentiment classifier, then apply this to both Facebook topic data and Twitter.
 

Analyzing Facebook topic data

Finally to complete our workflow we let's take a quick look at submitting analysis queries in PYLON. As you cannot access the raw data this is how you get your insights from the data your recorded to your index.
 
When you submit a query to PYLON you specify the data field you'd like to analyze, what analysis to perform and which segment of the data in your index to analyze.
 
Here you get to use your CSDL skills once more, as CSDL is used to filter the data in your index into precise segments. You can drill down deeper and deeper to get very detailed insights.
 
As a quick example we could start by analyzing the age groups in our entire dataset as a frequency distribution. In PYLON you specify the following arguments:
 
{

    "analysis_type":"freqDist",
    "parameters":
    {
        "target":"fb.author.age"
    }
}
 
Using CSDL we can take this further selecting a precise subset, based upon demographics and tags we introduced in our example. To do so we simply specify a filter when making the analysis request:
 
interaction.tag_tree.game == "Candy Crush" AND fb.author.gender == "female" \nAND fb.author.region == "UK"
 
Immediately this gives us some powerful insights, breaking down players by age group:
 
 
This is just a taste of how powerful PYLON and VEDO are when analyzing Facebook topic data! In future posts we'll look at filtering, classification and analysis queries in much more depth.

 

Richard Caudle's picture

New Community Site Launched!

Things change very fast at DataSift, it can be hard to keep up.
 
This week we've released our new community site at community.datasift.com.
 
 
 
The community site hosts our forums going forwards. It's the best place to ask questions to be answered by our staff and other developers, and you can also leave feedback and make suggestions to our team.
 
Another role of the community site is to keep you better informed of platform changes and to build a community around our platform. The Announcments category will be used to keep you up-to-date with platform changes, and soon we will hosting events for developers which will also be announced to the community.
 
Of course if you have a support package our Support team is always on hand too.

Subscribing to Announcements

To subscribe to new posts in any category (although Announcements will probably be the most interesting to get you started), you need to do the following:

  • Visit community.datasift.com
  • Click Log In in the top right corner
  • The new site has SSO integration with dev.datasift.com. If you have an account already sign in, if not register for a developer account.
  • Go to the Announcements category 
  • Click the top right dropdown and choose Watching

 
 

Pages