Blog posts in Announcements

Richard Caudle's picture

DataSift PYLON - The value of unified data processing

Applying CSDL and VEDO to Facebook topic data

You might have seen our recent announcement which enables developers to gain insights from Facebook topic data. No doubt you're eager to learn more! In this post we'll look at how easy it is to take CSDL you've fine-tuned for filtering data from networks such as Twitter or Tumblr, and apply this in PYLON both for not only filtering but also for classifying Facebook topic data. This demonstrates the simplicity of using a unified data processing platform. 
 
Before we jump in too deep, at this point you're probably keen to know exactly what PYLON is. Let's give you a quick intro…

What is PYLON?

PYLON is a new API that enables a privacy-first approach to analyzing Facebook posts and engagement data. Using the PYLON API, for the first time you can build insights from Facebook's topic data.
 
With PYLON you can analyze the billions of posts and media engagement that takes place on Facebook every day, but respect the privacy of individual people using Facebook. 
 
The sources of aggregate and anonymized information that can be extracted from Facebook topic data can be broadly categorized as posts and engagement data:
 
  • Posts on pages: Content, Topics, Links and Hashtags
  • Engagement data: Comment, Shares, Likes

Privacy-First

PYLON gives you access to Facebook topic data, but it also protects the identity of individuals on Facebook. Personally identifiable information (PII) is never exposed.

  • You receive statistical summarized results, never individual posts.
  • A minimum audience-threshold of 100 is applied to any analysis you perform to protect from individual-level analysis.
  • Data is processed within Facebook’s own data centre. Raw data never leaves Facebook’s servers.
  • Interaction data is only available for analysis for up to 30 days, after which time it is deleted from the service.

So with PYLON you can generate insights that before now were not possible, yet in a way that ensure privacy for Facebook users!

PYLON Workflow

If you've worked with DataSift before you'll be used to the Stream workflow:
 
  • Create filters in CSDL against your enabled data sources
  • Add classification rules to your filters using VEDO
  • Deliver raw, classified data to your chosen destination
 
PYLON is different, as to maintain privacy the output is delivered as summary results:
 
  • Create a filter that matches interactions on Facebook
  • Include classification rules to add value to the data
  • Record output from this filter into a private index
  • Submit analysis queries to the index to receive analysis results
 
 
The data you filter is recorded to a PYLON index only you have access to. You access your index using analysis queries which return summary statistical results from Facebook topic data.
 

Apply your CSDL expertise to Facebook topic data

So until now if you're a DataSift customer you'll have been using our Stream product, this provides access to filter from a variety of social data sources such as Twitter, Tumblr, and blogs.  
 
In this scenario, you'll have created filters, for example to capture tweets about popular mobile games:
 
( \n\tinteraction.content contains_any "Game of War,Boom Beach,Pet Rescue,Candy Crush,Don't Tap, \n2048,FarmVille 2" \n\tOR interaction.content contains_near "Clash,Clans:5" \n\tOR interaction.content contains_near "Puzzle,Dragons:5" \n\tOR interaction.content contains_all "step,white,tile" \n) \nAND NOT interaction.content wildcard "cheat*" \nAND NOT interaction.content wildcard "hack*"
 
Even if you've not familiar with CSDL syntax, you can see how easy it is to work across different data fields (which we call targets), make use of operators such as contains_any and combine many conditions with boolean operators.
 

Applying CSDL to Facebook topic data

With PYLON just like Stream your first step is to create a filter for the interactions you'd like to record to your index.
 
The best thing about PYLON is that under the hood it uses exactly the same filtering engine to filter Facebook topic data stream, just as is used to filter the Twitter firehose.
 
So, let's take our example CSDL from above. If we'd like to index the Facebook topic data which covers exactly the same criteria as we used with Twitter:
 
( \n\tinteraction.content contains_any "Game of War,Boom Beach,Pet Rescue,Candy Crush,Don't Tap, \n2048,FarmVille 2" \n\tOR interaction.content contains_near "Clash,Clans:5" \n\tOR interaction.content contains_near "Puzzle,Dragons:5" \n\tOR interaction.content contains_all "step,white,tile" \n) \nAND NOT interaction.content wildcard "cheat*" \nAND NOT interaction.content wildcard "hack*"
 
Yes, that's right - the CSDL is exactly the same! 
 
This works because common data fields in both sources are mapped to the interaction namespace. In this case we're making heavy use of interaction.content which is the content of the post or tweet. Also regardless of the data field you're filtering upon, the operators such as contains_any work in exactly the same way too.
 
Of course, you can also utilise the unique data available from each network. For example, Facebook topic data contains audience demographics and topic-data which is not present in Twitter.  With this, you could expand your CSDL to focus your analysis on for males talking about cars who live in California.
 
fb.author.gender == "male" and fb.author.region == "California" \nand fb.topics.category == "Cars"
 
With Facebook topic data you can filter on demographics, but what's even better is that PYLON ensures privacy for Facebook users as returned results are anonymous, aggregated summaries.
 

Classifying with VEDO

The power of CSDL doesn't stop at filtering to identify data for analysis. You can use the same language, operators and data fields to classify your data using VEDO.
 
Classifying data this way is particularly important when using PYLON as privacy guards prevent you from accessing raw data. By using VEDO with PYLON you can add extra value to the data before it is saved in your index, and then make use of this extra data when submitting your analysis queries.
 
Let's build on our example above by adding tags to data that has been filtered. We can do so by adding tag rules to our filter:
 
tag.game "Candy Crush" { interaction.content contains "Candy Crush"} \ntag.game "Game of War" { interaction.content contains "Game of War"} \ntag.game "Boom Beach" { interaction.content contains "Boom Beach"} \ntag.game "Pet Rescue" { interaction.content contains "Pet Rescue"} \ntag.game "Don't Tap" { interaction.content contains "Don't Tap"} \ntag.game "2048" { interaction.content contains "2048"} \ntag.game "FarmVille 2" { interaction.content contains "FarmVille 2"} \ntag.game "Clash of the Clans" { interaction.content contains_near "Clash,Clans:5"} \ntag.game "Puzzles & Dragons" { interaction.content contains_near "Puzzles,Dragons:5"} \ntag.game "Don't Step On The White Tile" { interaction.content contains_all "step,white,tile" }
 
Again this code can be run on Twitter data using the Stream product or against Facebook data in PYLON. 
 
We can take things further by applying machine learned classifiers. For instance we could create a custom sentiment classifier, then apply this to both Facebook topic data and Twitter.
 

Analyzing Facebook topic data

Finally to complete our workflow we let's take a quick look at submitting analysis queries in PYLON. As you cannot access the raw data this is how you get your insights from the data your recorded to your index.
 
When you submit a query to PYLON you specify the data field you'd like to analyze, what analysis to perform and which segment of the data in your index to analyze.
 
Here you get to use your CSDL skills once more, as CSDL is used to filter the data in your index into precise segments. You can drill down deeper and deeper to get very detailed insights.
 
As a quick example we could start by analyzing the age groups in our entire dataset as a frequency distribution. In PYLON you specify the following arguments:
 
{

    "analysis_type":"freqDist",
    "parameters":
    {
        "target":"fb.author.age"
    }
}
 
Using CSDL we can take this further selecting a precise subset, based upon demographics and tags we introduced in our example. To do so we simply specify a filter when making the analysis request:
 
interaction.tag_tree.game == "Candy Crush" AND fb.author.gender == "female" \nAND fb.author.region == "UK"
 
Immediately this gives us some powerful insights, breaking down players by age group:
 
 
This is just a taste of how powerful PYLON and VEDO are when analyzing Facebook topic data! In future posts we'll look at filtering, classification and analysis queries in much more depth.

 

Richard Caudle's picture

New Community Site Launched!

Things change very fast at DataSift, it can be hard to keep up.
 
This week we've released our new community site at community.datasift.com.
 
 
 
The community site hosts our forums going forwards. It's the best place to ask questions to be answered by our staff and other developers, and you can also leave feedback and make suggestions to our team.
 
Another role of the community site is to keep you better informed of platform changes and to build a community around our platform. The Announcments category will be used to keep you up-to-date with platform changes, and soon we will hosting events for developers which will also be announced to the community.
 
Of course if you have a support package our Support team is always on hand too.

Subscribing to Announcements

To subscribe to new posts in any category (although Announcements will probably be the most interesting to get you started), you need to do the following:

  • Visit community.datasift.com
  • Click Log In in the top right corner
  • The new site has SSO integration with dev.datasift.com. If you have an account already sign in, if not register for a developer account.
  • Go to the Announcements category 
  • Click the top right dropdown and choose Watching

 
 
Richard Caudle's picture

Announcing Tencent Weibo - Broaden Your Coverage Of Chinese Conversation

In a previous post I discussed how we're broadening our reach to help you get the best out of East Asian sources such as using our Chinese tokenization engine.
 
To build on this momentum, I'm excited to be able to announce a new data source for Tencent Weibo, another huge Chinese network you'll be eager to get your hands on. Now you can build more comprehensive solutions for the Chinese market with ease.
 

Tencent Weibo - A Key Piece In Your Chinese Social Jigsaw

China has the most active social network community in the world. With over 600 million Internet users on average spending 40% of their online time using social networks, there's an awful lot of conversation out there which no doubt you'd love to harness.
 
There are a wide variety of social networks used in China, one of the largest is Tencent Weibo. Tencent Weibo gives great coverage of 3rd and 4th tier cities, essentially emerging markets which already have large populations and are experiencing massive growth. To generate full insights, and generate maximum opportunity from Chinese markets it is essential that you listen to these conversations.
 

Understanding Tencent Weibo Interactions

Tencent Weibo is modelled largely on Twitter. Just like Twitter users can use up to 140 characters for a post, and can share photos and videos. As a result Tencent Weibo lends itself to similar use cases you may already have set up with Twitter.
 
We expose as much data as possible to you through targets. A full list of the Tencent Weibo targets can be found in our documentation. Here are a few highlights to get you started though.
 

Types of Interaction

Tencent also has it's own types of activity which are very similar to Twitter. A 'post' is the equivalent of a tweet, and a 'repost' is the equivalent of retweet. 
 
A reply is slightly different however. If you reply on Twitter, you mention the user you are replying to. On Tencent Weibo when you reply you are actually continuing a specific thread and do not need to mention the user you are replying to.
 
To distinguish between these types you can use the tencentweibo.type target.
 

Thread ID

As I mentioned above Tencent Weibo runs a threaded conversation model. You can filter to certain conversations by using the thread ID, exposed by the tencentweibo.thread_id target.
 
This is very useful because you can for example pick up a first post which discusses a topic you're interested in, then you can make a note of the thread ID and track any replies which follow.
 

Author's Influence

Frequently you'll want to know a little more about the content's author. Three useful pieces of metadata you can work with are:
  • tencentweibo.author.followers_count: The number of followers a user has
  • tencentweibo.author.following_count: The number of users the user follows
  • tencentweibo.author.statuses_count: The number of posts the user has created
Commonly we use similar features to identify spam on Twitter. For example we might filter out content from users who follow a high number of users, but themselves have few followers, as this is a common signature for a bot.
 

Tencent In Action

Ok, so you've decided that you want to tap into the world of Tencent Weibo conversation. How does this work in practice? Let's look at a quick example.
 
A common use of the new data source will be brand monitoring, so let's write some CSDL that picks out well-known brands from Tencent Weibo chatter. For this example I'm going use the targets I discussed above to filter down to influential authors who are posting original content, this will give us the more pertinent data for our use case.
 
To filter to influential users I can use the tencentweibo.author.followers_count target:
 
tencentweibo.author.followers_count >= 10000
 
To filter to original posts (so exclude replies and reposts) I can use the tencentweibo.type target: 
 
tencentweibo.type == "post"
 
To filter to a list of brands I'm interested in (Coca-Cola, Walmart, etc.): 
 
tencentweibo.text contains_any [language(zh)] "可口可乐, 谷歌, 沃尔玛, 吉列, 亚马逊, 麦当劳, 联合利华, 葛兰素史克, 路虎, 维珍航空"
 
Trust me for now on the translations! Things will get clearer in a minute.
 
The expression I've used here uses the tencentweibo.text target, which exposes the text content of the post. Following this I make use of Chinese tokenization, using the [language(zh)] switch as explained in my previous post to ensure accurate matching of my brand names.
 
My finished filter becomes: 
 
So now I have a stream of original content from influential authors discussing my collection of brands. In just a few minutes and extremely powerful definition.
 

A Helping Hand From VEDO

Honestly, I struggle when working with Chinese data, because I can't speak a word of Mandarin or Cantonese. (I did once spend a month in China and picked up my Chinese nickname of 'silver dragon', but unfortunately I got no further.) Fortunately I can make use of VEDO tagging to help me understand the data.
 
I can write a simple tag to pick out each brand mention, for example "Coca-Cola", as follows:
 
tag.brand "Coca-Cola" { tencentweibo.text contains [language(zh)] "可口可乐"
 
Notice that tag.brand is part of VEDO tagging, this declares a namespace for the "Coca-Cola" tag which follows. The braces that follow the tag contain an expression, which if matched for an interaction will cause the tag to be applied to the interaction. When the data arrives at my application the data is tagged with the brand name in English and therefore makes it much easier for me to work with.
 
Remember that VEDO tags are applied to data that has been first filtered by a filter wrapped in the return clause. In my final definition I'll add a line for each brand. 
 
For a refresher on VEDO, please take a look at my earlier posts
 

Putting It All Together

I can put my filter together with my tags by wrapping the filter in a return clause. My completed CSDL is as follows:
 
Running this stream in preview you can see that conversation on Tencent Weibo is being nicely categorised so it can be much more easily understood.
 
 

Over To You...

This concludes my whirlwind introduction to Tencent Weibo. Technology aside, it's definitely worth emphasising again that Tencent Weibo is a vital source if you want to maximise opportunities in Chinese marketplaces. 
 
For a full reference on Tencent Weibo targets, please see our technical documentation.
 
To stay in touch with all the latest developer news please subscribe to our RSS feed at http://dev.datasift.com/blog/feed.
Richard Caudle's picture

Chinese Tokenization - Generate Accurate Insight From Chinese Sources Including Sina Weibo

We all know that China is a vitally important market for any international brand. Until recently it has been difficult to access conversation from Chinese networks and tooling support for East Asian languages has been limited. This is why at DataSift we're proud to not only now offer access to Sina Weibo, but equally important we have greatly improved our handling of Chinese text to allow you to get the most from this market.
 

The Challenges Of East Asian Social Data

Until now it has been difficult to access social insights from markets such as China, for two reasons:
 
  • Access to data: Almost all conversations take place on local social networks, rather than Twitter and Facebook. The ecosystem around these local networks has been less mature, and therefore gaining access has been more challenging.
  • Inadequate tooling: Even if you could gain access to these sources, the vast majority of tools are heavily biased towards European languages, trained on spacing and punctuation which simply don't exist in East Asian text. Inadequate tooling leads to poor and incomplete insights.
Happily our platform now solves both of these challenges for you. Firstly we now give you access to Sina Weibo. Secondly, we have greatly improved our handling of Chinese text, to give you exactly the same powers you'd expect when processing European languages. Specifically we support Mandarin, simplified Chinese text.
 
Incidentally, we also tokenize Japanese content which is a different challenge to Chinese text. The methods of tokenization are quite different but equally important to the accuracy of your filters. Read a detailed post here from our Data Science team.
 

Moving Beyond Substring Matching

In the past our customers have been able to filter Chinese content by using the substr operator. This can give you inaccurate results though because the same sequence of Chinese characters can have different meanings. 
 
Take for example the brand Samsung, which is written as follows:
 
三星
 
These characters are also present in the phrase "three weeks" and many place names. So a simple filter using substr like so could give you unwanted data:
 
interaction.content substr "三星"
 
It would match both of these sentences:
 
我爱我新的三星电视!  (I love my new Samsung TV!)
我已经等我的包裹三星期了!  (I've been waiting three weeks for my parcel to arrive!')
 
By tokenizing the text into words, and allowing our customers to filter using operators such as contains, our customers can now receive more accurately filtered data.
 

Tokenization 101

The key to handling Chinese text accurately is through intelligent tokenization. Through tokenization we can provide you with our full range of text matching operators, rather than simple substring matching. 
 
I realise this might not be immediately obvious, so I'll explain using some examples.
 
Let's start with English. You probably know already can use CSDL (our filtering language) to look for mentions of words like so:
 
interaction.content contains_near "recommend,tv:4"
 
This will match content where the words 'recommend' and 'tv' are close together, such as: 
 
Can anyone recommend a good TV?
 
This works because our tokenization engine internally breaks the content into words for matching, using spaces as word boundaries:
 
Can anyone recommend a good TV ?
 
With this tokenization in place we can run operators such as contains and contains_near.
 
However, with Chinese text there are no spaces between words. In fact Chinese text contains long streams of characters, with no hard and fast rules for word boundaries that can be simply implemented.
 

Chinese Tokenization

The translation of 'Can any recommend a good TV?' is:
 
你能推荐一个好的电视吗
 
With the new Chinese tokenization support, internally the platform breaks the content into words as follows:
 
推荐 一个 好的 电视
You can recommend a good television ?
 
The DataSift tokenizer uses a machine learned model to select the most appropriate tokenization and gives highly accurate results. This learned model has been extensively trained is constantly updated.
 
Our CSDL to match this would be:
 
interaction.content contains_near [language(zh)] "推荐,电视:4"
 
The syntax [language(zh)] tells the engine that you would like to tokenize content using Chinese tokenization rules.
 

Best Practice

To ensure the accuracy of the filter, we recommend you add further keywords or conditions. For example, the following filters for content contain Samsung and TV:
 
interaction.content contains [language(zh)] "三星"
AND interaction.content contains [language(zh)] "电视"
 
This may seem like we're cheating(!), but in fact a native Chinese speaker would also rely on other surrounding text to decide that it is indeed Samsung the brand being discussed.
 

Try It For Yourself

So in summary, not only do we now provide access to Chinese social networks, but just as important our platform takes you beyond simple substring matching to give you much greater accuracy in your results.
 
If you don't have access to the Sina Weibo source you can start playing with Chinese tokenization immediately via Twitter. The examples above will work nicely because they work across all sources.
 
For a full reference on the new sources, please see our technical documentation.
 
To stay in touch with all the latest developer news please subscribe to our RSS feed at http://dev.datasift.com/blog/feed
Richard Caudle's picture

Announcing LexisNexis - Monitor Reputation, Threats & Opportunities Through Global News Coverage

At DataSift we are chiefly known for our social data coverage, but increasingly you will see us broadening our net. LexisNexis provides news content from more than 20,000 media outlets worldwide, including content from newspapers, consumer magazines, trade journals, key blogs and TV transcripts. As such it provides an unrivalled source for reputation management, opportunity identification and risk management.
 

The LexisNexis Source

LexisNexis is a long-established, highly regarded provider of news coverage which is already relied upon by a wide range of organisations worldwide. The LexisNexis source, now available on our platform, gives you a compliant source for fully licensed, full text articles. The breadth of LexisNexis's coverage is truly impressive, and when put alongside our social data sources opens up a whole new range of possibilities to you.
 

How Could You Use It?

Social data, although rich with opinion and potential insight is only one part of the picture. In many cases to get a full picture you will want to see how a topic is being covered in the published media.
 
Some use cases that spring to mind include:
  • Reputation management: Spot important trends, new opportunities and potential threats and act on them before anyone else. By monitoring news content you can proactively monitor negative opinions, adverse developments and identify risks. Alongside LexisNexis you could add social data sources, so monitor reputation on both social networks and published media. 
  • Opportunity identification: By staying on top of the latest news stories, companies can anticipate customers' emerging needs and stay one step ahead of their competition. LexisNexis covers newspapers, press releases, specialist trade journals and regional publications so you can stay on top of breaking news.
  • Risk monitoring: There are many factors that can impact business performance, including the state of local economies, political upheaval and legislative change. Using LexisNexis news and legal coverage, keep abreast of issues that impact your suppliers and clients, and changes in local markets that could harm your business around the globe.

An Example Filter

To make things a little more concrete, let's consider the example of reputation management. 
 
Let's imagine I work for a large corporation and I want to monitor what is being said about my corporation in my local market across magazines, newspapers and by broadcasters. I can listen for mentions and alert my PR team, who can take steps to redress or amplify the coverage as necessary.
 
A simple example filter could be:
 
Using a DataSift destination I could integrate this data immediately as it arrives in to my existing tools and systems and inform my PR team.
 

LexisNexis SmartIndexing Technology™

As a quick aside, this seems to be a good time to discuss indexing / categorisation. LexisNexis through their SmartIndexing Technology, provide comprehensive indexing of content. This indexing identifies subjects, industries, companies, organizations, people and places and is exposed through the platform under the lexisnexis.indexing property. LexisNexis's advanced indexing operates beyond explicit keywords, identifying topics that are implied through context and previous experience.
 
This indexing feature greatly simplifies your queries and gives the content far richer context and meaning which you can take advantage of. This of course adds to the augmentations and custom categorisation features of the DataSift platform.
 
You can see in the example above I've used the company and country index to filter to Apple plus USA. If I'd filtered for 'Apple' using just keywords gives ambiguous results, so the indexing feature is extremely valuable here and gives much more accurate results.
 

LexisNexis + VEDO

Taking the example above one step further, I can also take advantage of VEDO tagging & scoring.
 
For instance, I can use scoring to give a notion of priority to the mentions so I can inform my PR team which are the most important mentions to act upon. As an illustrative example:
 
When the data is received by my PR team they can now easily prioritise their actions based on the scoring rules.
 

Can The LexisNexis Source Help You?

The addition of LexisNexis to the DataSift source family is an exciting step as use cases such as reputation and risk management are now so vital to organisations. Watch this space for further announcements on new sources as we continue to expand from our social roots.
 
For a full reference on the new source, please see our technical documentation.
 
To stay in touch with all the latest developer news please subscribe to our RSS feed at http://dev.datasift.com/blog/feed and keep an eye on our twitter account @DataSiftDev.

Pages