Blog posts in Announcements

Richard Caudle's picture
Richard Caudle
Updated on Wednesday, 26 March, 2014 - 12:35
We all know that China is a vitally important market for any international brand. Until recently it has been difficult to access conversation from Chinese networks and tooling support for East Asian languages has been limited. This is why at DataSift we're proud to not only now offer access to Sina Weibo, but equally important we have greatly improved our handling of Chinese text to allow you to get the most from this market.
 

The Challenges Of East Asian Social Data

Until now it has been difficult to access social insights from markets such as China, for two reasons:
 
  • Access to data: Almost all conversations take place on local social networks, rather than Twitter and Facebook. The ecosystem around these local networks has been less mature, and therefore gaining access has been more challenging.
  • Inadequate tooling: Even if you could gain access to these sources, the vast majority of tools are heavily biased towards European languages, trained on spacing and punctuation which simply don't exist in East Asian text. Inadequate tooling leads to poor and incomplete insights.
Happily our platform now solves both of these challenges for you. Firstly we now give you access to Sina Weibo. Secondly, we have greatly improved our handling of Chinese text, to give you exactly the same powers you'd expect when processing European languages. Specifically we support Mandarin, simplified Chinese text.
 
Incidentally, we also tokenize Japanese content which is a different challenge to Chinese text. The methods of tokenization are quite different but equally important to the accuracy of your filters. Read a detailed post here from our Data Science team.
 

Moving Beyond Substring Matching

In the past our customers have been able to filter Chinese content by using the substr operator. This can give you inaccurate results though because the same sequence of Chinese characters can have different meanings. 
 
Take for example the brand Samsung, which is written as follows:
 
三星
 
These characters are also present in the phrase "three weeks" and many place names. So a simple filter using substr like so could give you unwanted data:
 
interaction.content substr "三星"
 
It would match both of these sentences:
 
我爱我新的三星电视!  (I love my new Samsung TV!)
我已经等我的包裹三星期了!  (I've been waiting three weeks for my parcel to arrive!')
 
By tokenizing the text into words, and allowing our customers to filter using operators such as contains, our customers can now receive more accurately filtered data.
 

Tokenization 101

The key to handling Chinese text accurately is through intelligent tokenization. Through tokenization we can provide you with our full range of text matching operators, rather than simple substring matching. 
 
I realise this might not be immediately obvious, so I'll explain using some examples.
 
Let's start with English. You probably know already can use CSDL (our filtering language) to look for mentions of words like so:
 
interaction.content contains_near "recommend,tv:4"
 
This will match content where the words 'recommend' and 'tv' are close together, such as: 
 
Can anyone recommend a good TV?
 
This works because our tokenization engine internally breaks the content into words for matching, using spaces as word boundaries:
 
Can anyone recommend a good TV ?
 
With this tokenization in place we can run operators such as contains and contains_near.
 
However, with Chinese text there are no spaces between words. In fact Chinese text contains long streams of characters, with no hard and fast rules for word boundaries that can be simply implemented.
 

Chinese Tokenization

The translation of 'Can any recommend a good TV?' is:
 
你能推荐一个好的电视吗
 
With the new Chinese tokenization support, internally the platform breaks the content into words as follows:
 
推荐 一个 好的 电视
You can recommend a good television ?
 
The DataSift tokenizer uses a machine learned model to select the most appropriate tokenization and gives highly accurate results. This learned model has been extensively trained is constantly updated.
 
Our CSDL to match this would be:
 
interaction.content contains_near [language(zh)] "推荐,电视:4"
 
The syntax [language(zh)] tells the engine that you would like to tokenize content using Chinese tokenization rules.
 

Best Practice

To ensure the accuracy of the filter, we recommend you add further keywords or conditions. For example, the following filters for content contain Samsung and TV:
 
interaction.content contains [language(zh)] "三星"
AND interaction.content contains [language(zh)] "电视"
 
This may seem like we're cheating(!), but in fact a native Chinese speaker would also rely on other surrounding text to decide that it is indeed Samsung the brand being discussed.
 

Try It For Yourself

So in summary, not only do we now provide access to Chinese social networks, but just as important our platform takes you beyond simple substring matching to give you much greater accuracy in your results.
 
If you don't have access to the Sina Weibo source you can start playing with Chinese tokenization immediately via Twitter. The examples above will work nicely because they work across all sources.
 
For a full reference on the new sources, please see our technical documentation.
 
To stay in touch with all the latest developer news please subscribe to our RSS feed at http://dev.datasift.com/blog/feed
Richard Caudle's picture
Richard Caudle
Updated on Thursday, 20 February, 2014 - 09:28
At DataSift we are chiefly known for our social data coverage, but increasingly you will see us broadening our net. LexisNexis provides news content from more than 20,000 media outlets worldwide, including content from newspapers, consumer magazines, trade journals, key blogs and TV transcripts. As such it provides an unrivalled source for reputation management, opportunity identification and risk management.
 

The LexisNexis Source

LexisNexis is a long-established, highly regarded provider of news coverage which is already relied upon by a wide range of organisations worldwide. The LexisNexis source, now available on our platform, gives you a compliant source for fully licensed, full text articles. The breadth of LexisNexis's coverage is truly impressive, and when put alongside our social data sources opens up a whole new range of possibilities to you.
 

How Could You Use It?

Social data, although rich with opinion and potential insight is only one part of the picture. In many cases to get a full picture you will want to see how a topic is being covered in the published media.
 
Some use cases that spring to mind include:
  • Reputation management: Spot important trends, new opportunities and potential threats and act on them before anyone else. By monitoring news content you can proactively monitor negative opinions, adverse developments and identify risks. Alongside LexisNexis you could add social data sources, so monitor reputation on both social networks and published media. 
  • Opportunity identification: By staying on top of the latest news stories, companies can anticipate customers' emerging needs and stay one step ahead of their competition. LexisNexis covers newspapers, press releases, specialist trade journals and regional publications so you can stay on top of breaking news.
  • Risk monitoring: There are many factors that can impact business performance, including the state of local economies, political upheaval and legislative change. Using LexisNexis news and legal coverage, keep abreast of issues that impact your suppliers and clients, and changes in local markets that could harm your business around the globe.

An Example Filter

To make things a little more concrete, let's consider the example of reputation management. 
 
Let's imagine I work for a large corporation and I want to monitor what is being said about my corporation in my local market across magazines, newspapers and by broadcasters. I can listen for mentions and alert my PR team, who can take steps to redress or amplify the coverage as necessary.
 
A simple example filter could be:
 
Using a DataSift destination I could integrate this data immediately as it arrives in to my existing tools and systems and inform my PR team.
 

LexisNexis SmartIndexing Technology™

As a quick aside, this seems to be a good time to discuss indexing / categorisation. LexisNexis through their SmartIndexing Technology, provide comprehensive indexing of content. This indexing identifies subjects, industries, companies, organizations, people and places and is exposed through the platform under the lexisnexis.indexing property. LexisNexis's advanced indexing operates beyond explicit keywords, identifying topics that are implied through context and previous experience.
 
This indexing feature greatly simplifies your queries and gives the content far richer context and meaning which you can take advantage of. This of course adds to the augmentations and custom categorisation features of the DataSift platform.
 
You can see in the example above I've used the company and country index to filter to Apple plus USA. If I'd filtered for 'Apple' using just keywords gives ambiguous results, so the indexing feature is extremely valuable here and gives much more accurate results.
 

LexisNexis + VEDO

Taking the example above one step further, I can also take advantage of VEDO tagging & scoring.
 
For instance, I can use scoring to give a notion of priority to the mentions so I can inform my PR team which are the most important mentions to act upon. As an illustrative example:
 
When the data is received by my PR team they can now easily prioritise their actions based on the scoring rules.
 

Can The LexisNexis Source Help You?

The addition of LexisNexis to the DataSift source family is an exciting step as use cases such as reputation and risk management are now so vital to organisations. Watch this space for further announcements on new sources as we continue to expand from our social roots.
 
For a full reference on the new source, please see our technical documentation.
 
To stay in touch with all the latest developer news please subscribe to our RSS feed at http://dev.datasift.com/blog/feed and keep an eye on our twitter account @DataSiftDev.
Richard Caudle's picture
Richard Caudle
Updated on Monday, 20 January, 2014 - 14:29

The launch of DataSift VEDO introduced new features to allow you to add structure to social data. Alongside we introduced the DataSift library to help you build solutions faster and learn quicker.

Today we continue this theme by adding further items to the library. These include examples of machine learned classifiers which are sure to whet your appetite and get your creative juices flowing.

Machine Learned Classifiers

Since we announced VEDO there's been a lot of buzz around the possibilities of machine learning. Look out for a blog post coming very soon for an in-depth look.

We've introduced the following classifiers to the library to give you a taste of just what's possible:

  • Customer Service Routing - Many organisations employ staff to read customer service tweets and route them to the correct team. This classifier is trained specifially for airline customer services and shows how you could automate this process and save staffing costs.
  • Product Purchase Stage -  Knowing at what stage a customer is from initially assessing a product, through to ownership is incredibly powerful. This classifier demonstrates the concept and has been trained for PS4 discussion.
  • People vs Organizations - In many use cases you will want to distinguish between content created by organisations and individuals. This generic classifier allows you to do just that at scale.

These classifiers have been created by our Data Science team. They take a large sample of interactions from the platform, manually classify the interactions and use machine learning to learn key signals, which dictate which category interactions should belong to. The result is a set of scoring rules that form the classifier. The resulting classifier can be run against live or historic data ongoing.

You can try out any of the classifiers now by creating a stream from the example code at the bottom of the library item page. For more details see my previous post.

Geo-Based Classifiers

Knowing a user's location can be extremely valuable for many use cases, yet location as a field can be very tricky to normalise.

As an example of how VEDO can help you with this process, we've introduced the following classifiers, which normalise geo-location information:

  • Major Airports - Categorises tweets made in and around major airports
  • NBA Arenas - Categorises tweets made in and around NBA venues
  • NFL Stadiums - Categorises tweets made in and around NFL stadia.

Outside of game days you'll see little traffic around sporting venues, but try running these on a match day to see the power of these definitions!

Improved Classifiers

Alongside introducing new classifiers and increasing the library's breadth, we've also worked hard on improving further two existing classifiers. We think you'll find these two extremely useful in your solutions:

  • Professions & Roles -  We've restructured the taxonomy to professional function based on the LinkedIn hierarchy.
  • Twitter Source - This classifier has also been restructured to bucket applications into useful categories, including whether content has been manually created (say by a user on their mobile phone) or by an automated service.

Even More To Follow

We're not stopping here. Expect to see more and more items being added to the library, covering a wider range of use cases and industries. Keep an eye out for new items and please watch this blog for further news.

To stay in touch with all the latest developer news please subscribe to our RSS feed at http://dev.datasift.com/blog/feed

At this point let me encourage you to sign-up if you're not already a DataSift user, and then jump straight in the library and see it can make your life easier!

Richard Caudle's picture
Richard Caudle
Updated on Thursday, 9 January, 2014 - 12:03

The launch of DataSift VEDO introduced new features to allow you to add structure to social data. Alongside VEDO we also introduced the DataSift library - a great new resource to inspire you and make your life easier. Benefit from our experience and build better solutions faster.

What? Why? Where?

We've introduced the DataSift library to help you to benefit from our experience working with our customers. CSDL (our filtering and tagging language) is extremely powerful, but it might not be clear exactly what can be achieved. Using the library we'll share with you definitions we've written for real-world solutions so you can learn quicker and get the most from our platform.

Currently the library contains tagging and scoring definitions that demonstrate the power of VEDO. There are out-of-the-box components you can use straight away, and example rules you can take and run with.

You'll find the new Library as a new tab on the Streams page:

 

 

Supported Out-Of-The-Box Components

Items marked as 'supported' in the library are definitions you can count on us to maintain. You can confidently use these as part of your solution immediately.

You can also use these definitions as a base to start from. You can copy the definitions into your account and modify the rules to fit your use case. After all 'spam' for one use case can be gold for another!

Supported items include:

  • Competitions & Marketing: Scores content to say how likely it is to be noise from competitions and marketing campaigns.
  • Twitter Application Used: Identifies and categorises the source application used to create a tweet - great for picking out content from services, real users and bots.
  • Professions & Roles: Where possible identifies user's profession and seniority based on their profile description.

 

Real-World Example Solutions

Items marked as 'example' in the library are definitions which we've built that will help you learn from real-world samples. You can run these examples directly, but we envisage you using these definitions as starting points and modifying or extending them to fit your solution.

Example items include:

 

Using a Library Item

It's easy to make use of a library item. You can either import an item into one of your streams, or copy an item into your account and modify it to your heart's content.

Note that all of the library items are currently tagging and scoring rules. You'll need to use them with a return statement. For more details please see our technical documentation.

Importing a definition

At the bottom of the page for each library item you'll find a tab labelled Usage Examples. This tab shows you example code which you can copy and paste into a new stream and run a live preview.

 

The key here is the tags keyword and hash for the stream. You can copy and paste this line into any of your streams to import the tagging rules.

Copying a definition

On each library item page there is a snippet of code that shows you the entire, or part of the definition. You can click the Copy to new stream button to copy the entire definition to your account. You can then inspect and modify the code as you see fit.

 

 

More To Follow

We'll work hard on adding more and more items to the library so it becomes an extremely valuable resource. Keep an eye out for new items and please watch this blog for further news.

At this point let me encourage you to sign-up if you're not already a DataSift user, and then jump straight in the library and see it can make your life easier!

Richard Caudle's picture
Richard Caudle
Updated on Monday, 16 December, 2013 - 12:23

Today we announced the arrival of DataSift VEDO. In this post I’ll outline what this means to you as a developer or analyst.

DataSift VEDO gives you a robust solution to add structure to social data, solving one of the common challenges when working with unstructured ‘big data’. VEDO lets you define rules to classify data so that it fits your business model. The data delivered to your application needs less post-processing and is much easier to work with. The new features will save you time and give you a load more possibilities for your social data.

Data Is Meaningless Without Structure

When working with big data such as social content, one challenge you will always need to tackle is giving unstructured data meaningful structure. If you’re working with our platform currently, you will no doubt be extracting data to your server and running post-processing rules to organise the data to meet your needs.

Processing unstructured data is expensive and not much fun, but it’s where we excel. VEDO lets you offload processing onto our platform. You can now use CSDL (the same language you use for filtering) to add custom metadata labels and scores to data specifically for your use case.

Introducing Tagging And Scoring

VEDO introduces new features which let you attach this metadata, these are tagging and scoring.

Tagging allows you to categorize interactions to match your business model. Any interaction that matches a tagging rule will be given the appropriate text label, serving as a boolean flag to indicate whether an interaction belongs to a category.

Scoring builds on tagging allowing you to attach numerical values to interactions rather than just labels. Scoring allows you to build up a score over many rules, and allows you to model subtle concepts such as priority, intention and weighting.

As you begin to use tagging and scoring more and more, you will want to be able to organise your growing set of rules. To help we have also introduced tag namespaces and reusable tag definitions. Tag namespaces allow you to define taxonomies of tags. You can group tags at any number of levels in namespaces and build deep schemas to fully reflect your model. Reusable tag definitions allow you to perfect your rules and reuse them across any number of streams and projects.

Definition Library

Tagging and scoring are powerful features, but at this point you might not have grasped exactly how they can help you. Therefore alongside the tagging features we’ve also introduced a library of definitions to get you started. Some definitions you can use immediately in your streams (and benefit from our experience), and some serve as example definitions to show you what is now possible.

For example, we have definitions that help you score content for quality (such as how likely is the content a job advert?) and make it easier to exclude spam. On the other hand we have an example definition that shows how you can use the new features to classify conversations for customer service teams, picking out rants, raves and enquiries.

You can view the library here.

There’s More...

Although tagging is the main theme of the new release, there is an awful lot more happening here at DataSift. Alongside the release of VEDO we’re giving you more power, more connectivity and a wider range of sources to play with.

For instance we’ve just introduced delivery destinations for MySQL and PostgreSQL. These new destinations allow you to map your filtered data directly to a tabular schema and have it pushed directly into your database.

We’re also in the process of bringing many more sources onboard (you may have seen our recent announcements!), including many asian social networks.

Look out for improvements to help you work with a wider variety of languages, updates to our developer tools and client libraries, and much much more. I’ll cover these all soon.

Watch this space

In summary there’s far too much to cover in detail here. So watch this space, as over the coming weeks I’ll cover every feature of the new release in depth, with worked examples and sample code so you can take advantage of all these new powers for yourself.

If you can’t wait, all of these new features are fully documented in our Documentation area. Again, check out the new library for inspiration.

If you’re new to DataSift, what’s stopping you? Register now and experience the power of our platform for yourself!!

Pages