Blog

Richard Caudle's picture
Richard Caudle
Updated on Tuesday, 10 June, 2014 - 10:45
It would be easy if the world around us was static, but in practice things are always changing. Nowhere is this truer than in the world of social networks; users are constantly following new friends and expressing new thoughts. The filter you wrote yesterday is probably already out-of-date! 
 
On the DataSift platform you can update your filters on the fly via the API and avoid downtime for your application. This not only allows you to adapt to real-world changing scenarios, but in fact allows you to build much more powerful, dynamic social solutions. In this post I'll show you how this can be done.
 

Embracing Change

If you've ever built a software solution you'll know that things aren't quite as simple as you'd hope. The real world is always changing. 
 
For example imagine you're tracking conversation around a news story. You build a simple filter which looks for the terms and personalities involved in the story. This works great, but a few hours later the story has evolved. As stories evolve it is inevitable that the terms people are using to discuss it change. You'll need to react to this without missing any conversations.
 
Or, maybe you've built an awesome social app that allows users to input their interests and you're creating a filter from input. The next day the user updates their interests. You'll need to update your filter to the new interests without interrupting your service to the user.
 
A well-designed solution takes change in it's stride. 
 

Broad Brush Overview

Ok, so we want to build our wonderfully dynamic, super-duper-flexible social solution. What does this mean in practice? On the DataSift side of things we want to be able to update our streams (filtering & tagging) definitions on-the-fly, delivering data to the same destination, without missing any data.
 
Before we get to the deeper details, the broad principles are:
 
  • Create V1 of our stream: Build V1 of our stream definition, for instance from user input
  • Start consuming V1: Compile and stream V1 of our stream as usual via the API
  • Create V2 of our stream: Something has changed! Build V2 of our stream to adapt.
  • Start consuming V2: In parallel with streaming V1, we'll start streaming V2 of our stream.
  • Stop consuming V1: When we're happy V2 is streaming nicely, we'll stop streaming V1.
Essentially to avoid downtime (or missing data) we have a brief period where we're streaming both versions in parallel. Note we will need to handle de-duplication during this brief period. 
 

Let's Do It

Ok, so that's the principles explained. Let's see this in practice.
 
I wrote a stream last week to track conversations around popular games. Let's use this as an example. 
 
(For the complete example code take a look at this GIST.)
 

Create Stream V1

Version 1 of our stream will look for mentions of five popular games; 2048, Farmville 2, Swamp Attack, Trials Frontier and Don't Step The White Tile.
 
Note this is a simple illustrative example. In practice you might want to look for mentions by inspecting links being shared for instance.
 
 

Start Consuming V1

Now that we have our stream defined, we can compile the definition and start consuming data. In this example we'll use the Pull destination to get resulting data.
 
For this example I'll use the Python helper library.
 
 

Create Stream V2

We're now happily consuming data. But wait! There's a new game that's entered the charts that we must track. The game is Clash of the Clans, and it must be added to our filter.
 
It's easy to imagine you could generate such a filter from an API which gives you the latest game charts.
 
The updated filter looks as follows (notice the use of the contains_near operator to tolerate missing words from the title):
 
 

Start Consuming V2

The next step is to start streaming V2 of the stream in parallel with V1. 
 
 

De-duplicating Data

We now have two streams running in parallel. Until we stop stream 1 there's a good chance that the same interaction might be received on both streams, so it's important we de-duplicate the data received. 
 
How you go about this completely depends on the solution being built. Whatever way you choose you can use the interaction.id property of the interaction as a unique identifier. One way would be to have a unique key in a database (if this is where your data is being stored),  another simple way would to have a rolling in-memory list of IDs say for the last 5 minutes. Of course this decision depends on the volume of data you expect and the scale of your solution.
 

Stop Consuming V1

Now that  we have started streaming V2 of the stream we can stop consuming data from V1. 
 
When you start the second stream it will start immediately. However, if you want to be doubly sure that you do not miss any data we recommend that you wait for the first interaction from stream V2 to be received before stopping stream V1. Note that the platform will charge you for DPUs consumed and data received for each stream individually.
 
 

In Conclusion

And so ends my quick tour. I hope this post illustrates how you can switch to new stream definitions on the fly. This capability is likely to be key to real-world solutions you create, and hopefully inspires you to create some truly responsive applications.
 
For the complete example code take a look at this GIST.
 
To stay in touch with all the latest developer news please subscribe to our RSS feed at http://dev.datasift.com/blog/feed
 
And, or follow us on Twitter at @DataSiftDev.

 

Jason's picture
Jason
Updated on Tuesday, 6 May, 2014 - 11:18
Taking into account some great customer feedback, on May 1st, 2014 we released a number of minor changes to our Facebook Pages Managed Source. 
 

Potential Breaking Changes

Facebook Page Like and Comment Counts have been Deprecated

The facebook_page.likes_count and facebook_page.comment_count fields have been deprecated from DataSift's output. We found this data became outdated quickly; a better practice for displaying counts of likes and comments in your application is to count like and comment interactions as you receive them. 
 

Format for facebook_page.message_tags has Changed

facebook_page.message_tags fields were previously in two different formats dependant on whether they came from comments, or posts. This change ensures that all message_tags are provided in a consistent format; as a list of objects. An example of the new consistent format can be seen below:
 
 
Please ensure that if your application utilizes these fields, it can handle them as a list of objects.
 
 

New Output Fields

We have introduced a number of new output fields in interactions from the Facebook Pages Managed Source. You will be able to filter on many of these fields.
 

New “Page Like” Interactions

By popular request, we have introduced a new interaction with the subtype “page_like” for anonymous page-level likes.
This should now allow you to track the number of likes for a given page over time.
 
 
This subtype has two fields, `current_likes` and `likes_delta`. The first represents the current number of likes for a Facebook Page at the time of retrieval. The second represents the difference with the previously retrieved value. We only generate interactions of this type if `likes_delta` is not zero.  Also note that `likes_delta` can be negative, when the number of unlikes is greater than the number of likes between two retrievals.
 
This interaction type should allow visualizing page likes as a time series. In addition, filters on `likes_delta` could be used to detect trending pages.
 

‘from' Fields now Include a Username Where Available

Where it is provided to us, .from fields in Facebook Pages interactions now contain a .username field.
 
 
Please note that in some cases, this field is not returned by Facebook.
 

New Comment ‘Parent' Field

Objects of type comment include an optional .parent object, which contains a reference to a parent comment. The object structure is self-similar.
 
This will allow you to tell whether comments are nested or not, and associate them with a parent comment if so.
 
 

New ‘From’ Field in Post Objects

Objects of type comment/like include an additional .from field in their .post context object, which contains information about the author of the post they are referring to.
 
 

New CSDL Targets

We have introduced 12 new Facebook Pages targets. This includes targets to allow you to filter on the likes count of a page, the parent post being commented on, a Facebook user's username, and more. These new targets can all be found in our Facebook Pages targets documentation.
 

Other Changes

New Notifications for Access Token Issues

If a case occurs where all tokens for a given source have permanent errors, the source will become “disabled", and you will receive a notification. You should then update the source with new tokens, and restart it. 
 
Note that every error will also be present in the /source/log for that Managed Source.
 

Summary of Changes

  • facebook_page.likes_count and facebook_page.comment_count fields will be deprecated from DataSift's output
  • The facebook_page.message_tags output field format is changing to become a list of objects
  • We are introducing a new interaction with the subtype “page_like” for anonymous page-level likes
  • .from fields in Facebook Pages interactions now contain a .username field where available
  • Comment interactions will now include a parent object, referencing the parent comment
  • We are introducing a .from field to Facebook Pages .post objects, containing information about the post author
  • We are introducing a number of new CSDL targets for Facebook Pages
  • You will receive better notifications about issues with your Facebook Access Tokens
 
Richard Caudle's picture
Richard Caudle
Updated on Thursday, 1 May, 2014 - 09:54
This is a quick post to update you on some changes we've introduced recently to help you work with our platform and make your life a little easier.
 

Filtering On Content Age

We aim to deliver you data as soon as we possibly can, but for some sources there can be a delay between publication to the web and our delivery which is out of our control.
 
In most cases this does not have an impact, but in some situations (perhaps you only want to display extremely fresh content to a user) this is an issue.
 
For these sources we have introduced a new target, .age, which allows you to specify the maximum time since the content was posted. For instance if you want to filter on blog posts mentioning 'DataSift', making sure that you only receive posts published within the last hour:
 
blog.content contains "DataSift" AND blog.age < 3600
 
This new target applies to the Blog, Board, DailyMotion, IMDB, Reddit, Topix, Video and YouTube sources.
 

Push Destinations - New Payload Options

Many of our customers are telling us they can take much larger data volumes from our system. We aim to please, so have introduced options to help you get more data quicker.
 

Increased Payload Sizes

To enable you to receive more data quicker from our push connectors, we have upped the maximum delivery sizes for many of our destinations. See the table below for the new maximum delivery sizes.
 

Compression Support

As the data we deliver to you is text, compression can be used to greatly reduce the size of files we deliver, making transport far more efficient. Although compression rates do vary, we are typically seeing an 80% reduction in file size with this option enabled.
 
We have introduced GZip and ZLib compression to our most popular destinations. You can enable compression on a destination by selecting the option in your dashboard, or by specifying the output_param.compression parameter through the API.
 
When data is delivered you can tell it has been compressed in two ways:
 
HTTP destination: The HTTP header 'X-DataSift-Compression' will have the value none, zlib or gzip as appropriate
S3, SFTP destinations: Files delivered to your destination will have an addition '.gz' extension is they have been compressed, for example DataSift-xxxxxxxxxxxxxxxxxxx-yyyyyyy.json.gz
 
Here's a summary of our current push destinations support for these features.
 
Destination Maximum Payload Size Compression Support
HTTP 200 MB GZip, ZLib
S3 200 MB GZip
SFTP 50 MB GZip
CouchDB 50 MB  
ElasticSearch 200 MB  
FTP 200 MB  
MongoDB 50 MB  
MySQL 50 MB  
PostgreSQL 50 MB  
Pull 50 MB  
Redis 50 MB  
Splunk 50 MB  

Stay Up-To-Date

To stay in touch with all the latest developer news please subscribe to our RSS feed at http://dev.datasift.com/blog/feed
 
And, or follow us on Twitter at @DataSiftDev
Richard Caudle's picture
Richard Caudle
Updated on Tuesday, 29 April, 2014 - 09:17
In a previous post I discussed how we're broadening our reach to help you get the best out of East Asian sources such as using our Chinese tokenization engine.
 
To build on this momentum, I'm excited to be able to announce a new data source for Tencent Weibo, another huge Chinese network you'll be eager to get your hands on. Now you can build more comprehensive solutions for the Chinese market with ease.
 

Tencent Weibo - A Key Piece In Your Chinese Social Jigsaw

China has the most active social network community in the world. With over 600 million Internet users on average spending 40% of their online time using social networks, there's an awful lot of conversation out there which no doubt you'd love to harness.
 
There are a wide variety of social networks used in China, one of the largest is Tencent Weibo. Tencent Weibo gives great coverage of 3rd and 4th tier cities, essentially emerging markets which already have large populations and are experiencing massive growth. To generate full insights, and generate maximum opportunity from Chinese markets it is essential that you listen to these conversations.
 

Understanding Tencent Weibo Interactions

Tencent Weibo is modelled largely on Twitter. Just like Twitter users can use up to 140 characters for a post, and can share photos and videos. As a result Tencent Weibo lends itself to similar use cases you may already have set up with Twitter.
 
We expose as much data as possible to you through targets. A full list of the Tencent Weibo targets can be found in our documentation. Here are a few highlights to get you started though.
 

Types of Interaction

Tencent also has it's own types of activity which are very similar to Twitter. A 'post' is the equivalent of a tweet, and a 'repost' is the equivalent of retweet. 
 
A reply is slightly different however. If you reply on Twitter, you mention the user you are replying to. On Tencent Weibo when you reply you are actually continuing a specific thread and do not need to mention the user you are replying to.
 
To distinguish between these types you can use the tencentweibo.type target.
 

Thread ID

As I mentioned above Tencent Weibo runs a threaded conversation model. You can filter to certain conversations by using the thread ID, exposed by the tencentweibo.thread_id target.
 
This is very useful because you can for example pick up a first post which discusses a topic you're interested in, then you can make a note of the thread ID and track any replies which follow.
 

Author's Influence

Frequently you'll want to know a little more about the content's author. Three useful pieces of metadata you can work with are:
  • tencentweibo.author.followers_count: The number of followers a user has
  • tencentweibo.author.following_count: The number of users the user follows
  • tencentweibo.author.statuses_count: The number of posts the user has created
Commonly we use similar features to identify spam on Twitter. For example we might filter out content from users who follow a high number of users, but themselves have few followers, as this is a common signature for a bot.
 

Tencent In Action

Ok, so you've decided that you want to tap into the world of Tencent Weibo conversation. How does this work in practice? Let's look at a quick example.
 
A common use of the new data source will be brand monitoring, so let's write some CSDL that picks out well-known brands from Tencent Weibo chatter. For this example I'm going use the targets I discussed above to filter down to influential authors who are posting original content, this will give us the more pertinent data for our use case.
 
To filter to influential users I can use the tencentweibo.author.followers_count target:
 
tencentweibo.author.followers_count >= 10000
 
To filter to original posts (so exclude replies and reposts) I can use the tencentweibo.type target: 
 
tencentweibo.type == "post"
 
To filter to a list of brands I'm interested in (Coca-Cola, Walmart, etc.): 
 
tencentweibo.text contains_any [language(zh)] "可口可乐, 谷歌, 沃尔玛, 吉列, 亚马逊, 麦当劳, 联合利华, 葛兰素史克, 路虎, 维珍航空"
 
Trust me for now on the translations! Things will get clearer in a minute.
 
The expression I've used here uses the tencentweibo.text target, which exposes the text content of the post. Following this I make use of Chinese tokenization, using the [language(zh)] switch as explained in my previous post to ensure accurate matching of my brand names.
 
My finished filter becomes: 
 
So now I have a stream of original content from influential authors discussing my collection of brands. In just a few minutes and extremely powerful definition.
 

A Helping Hand From VEDO

Honestly, I struggle when working with Chinese data, because I can't speak a word of Mandarin or Cantonese. (I did once spend a month in China and picked up my Chinese nickname of 'silver dragon', but unfortunately I got no further.) Fortunately I can make use of VEDO tagging to help me understand the data.
 
I can write a simple tag to pick out each brand mention, for example "Coca-Cola", as follows:
 
tag.brand "Coca-Cola" { tencentweibo.text contains [language(zh)] "可口可乐"
 
Notice that tag.brand is part of VEDO tagging, this declares a namespace for the "Coca-Cola" tag which follows. The braces that follow the tag contain an expression, which if matched for an interaction will cause the tag to be applied to the interaction. When the data arrives at my application the data is tagged with the brand name in English and therefore makes it much easier for me to work with.
 
Remember that VEDO tags are applied to data that has been first filtered by a filter wrapped in the return clause. In my final definition I'll add a line for each brand. 
 
For a refresher on VEDO, please take a look at my earlier posts
 

Putting It All Together

I can put my filter together with my tags by wrapping the filter in a return clause. My completed CSDL is as follows:
 
Running this stream in preview you can see that conversation on Tencent Weibo is being nicely categorised so it can be much more easily understood.
 
 

Over To You...

This concludes my whirlwind introduction to Tencent Weibo. Technology aside, it's definitely worth emphasising again that Tencent Weibo is a vital source if you want to maximise opportunities in Chinese marketplaces. 
 
For a full reference on Tencent Weibo targets, please see our technical documentation.
 
To stay in touch with all the latest developer news please subscribe to our RSS feed at http://dev.datasift.com/blog/feed.
Richard Caudle's picture
Richard Caudle
Updated on Wednesday, 26 March, 2014 - 12:35
We all know that China is a vitally important market for any international brand. Until recently it has been difficult to access conversation from Chinese networks and tooling support for East Asian languages has been limited. This is why at DataSift we're proud to not only now offer access to Sina Weibo, but equally important we have greatly improved our handling of Chinese text to allow you to get the most from this market.
 

The Challenges Of East Asian Social Data

Until now it has been difficult to access social insights from markets such as China, for two reasons:
 
  • Access to data: Almost all conversations take place on local social networks, rather than Twitter and Facebook. The ecosystem around these local networks has been less mature, and therefore gaining access has been more challenging.
  • Inadequate tooling: Even if you could gain access to these sources, the vast majority of tools are heavily biased towards European languages, trained on spacing and punctuation which simply don't exist in East Asian text. Inadequate tooling leads to poor and incomplete insights.
Happily our platform now solves both of these challenges for you. Firstly we now give you access to Sina Weibo. Secondly, we have greatly improved our handling of Chinese text, to give you exactly the same powers you'd expect when processing European languages. Specifically we support Mandarin, simplified Chinese text.
 
Incidentally, we also tokenize Japanese content which is a different challenge to Chinese text. The methods of tokenization are quite different but equally important to the accuracy of your filters. Read a detailed post here from our Data Science team.
 

Moving Beyond Substring Matching

In the past our customers have been able to filter Chinese content by using the substr operator. This can give you inaccurate results though because the same sequence of Chinese characters can have different meanings. 
 
Take for example the brand Samsung, which is written as follows:
 
三星
 
These characters are also present in the phrase "three weeks" and many place names. So a simple filter using substr like so could give you unwanted data:
 
interaction.content substr "三星"
 
It would match both of these sentences:
 
我爱我新的三星电视!  (I love my new Samsung TV!)
我已经等我的包裹三星期了!  (I've been waiting three weeks for my parcel to arrive!')
 
By tokenizing the text into words, and allowing our customers to filter using operators such as contains, our customers can now receive more accurately filtered data.
 

Tokenization 101

The key to handling Chinese text accurately is through intelligent tokenization. Through tokenization we can provide you with our full range of text matching operators, rather than simple substring matching. 
 
I realise this might not be immediately obvious, so I'll explain using some examples.
 
Let's start with English. You probably know already can use CSDL (our filtering language) to look for mentions of words like so:
 
interaction.content contains_near "recommend,tv:4"
 
This will match content where the words 'recommend' and 'tv' are close together, such as: 
 
Can anyone recommend a good TV?
 
This works because our tokenization engine internally breaks the content into words for matching, using spaces as word boundaries:
 
Can anyone recommend a good TV ?
 
With this tokenization in place we can run operators such as contains and contains_near.
 
However, with Chinese text there are no spaces between words. In fact Chinese text contains long streams of characters, with no hard and fast rules for word boundaries that can be simply implemented.
 

Chinese Tokenization

The translation of 'Can any recommend a good TV?' is:
 
你能推荐一个好的电视吗
 
With the new Chinese tokenization support, internally the platform breaks the content into words as follows:
 
推荐 一个 好的 电视
You can recommend a good television ?
 
The DataSift tokenizer uses a machine learned model to select the most appropriate tokenization and gives highly accurate results. This learned model has been extensively trained is constantly updated.
 
Our CSDL to match this would be:
 
interaction.content contains_near [language(zh)] "推荐,电视:4"
 
The syntax [language(zh)] tells the engine that you would like to tokenize content using Chinese tokenization rules.
 

Best Practice

To ensure the accuracy of the filter, we recommend you add further keywords or conditions. For example, the following filters for content contain Samsung and TV:
 
interaction.content contains [language(zh)] "三星"
AND interaction.content contains [language(zh)] "电视"
 
This may seem like we're cheating(!), but in fact a native Chinese speaker would also rely on other surrounding text to decide that it is indeed Samsung the brand being discussed.
 

Try It For Yourself

So in summary, not only do we now provide access to Chinese social networks, but just as important our platform takes you beyond simple substring matching to give you much greater accuracy in your results.
 
If you don't have access to the Sina Weibo source you can start playing with Chinese tokenization immediately via Twitter. The examples above will work nicely because they work across all sources.
 
For a full reference on the new sources, please see our technical documentation.
 
To stay in touch with all the latest developer news please subscribe to our RSS feed at http://dev.datasift.com/blog/feed

Pages

Subscribe to Datasift Documentation Blog