Ed Stenson | 2nd March 2012
DataSift uses an engine called Salience to perform sentiment analysis. We've just upgraded to a new version, Salience 5 and, in this blog, I'll review the existing features and introduce the new ones.
Up to now, DataSift's sentiment analysis has offered two numeric values. The first is score that indicates the sentiment it was able to find in the body of a post. A score of zero indicates neutrality, whereas a positive score suggests positive sentiment and a negative score suggests the opposite.
Remember, a post might be a Tweet, but it might be content from MySpace or Facebook, for example. For a Tweet, that single score is sufficient because a Tweet can only ever have positive, negative, or neutral sentiment. However, other posts, a blog would be a good example, include a title as well as some content. In this case, Salience provides the title sentiment and the content sentiment separately.
This is useful in all sorts of ways. For example, you might be interested just in the sentiment in the title because you're looking for a particularly positive 'headline'. Alternatively you might want to investigate messages for which sentiment in the title and content are in conflict. A CSDL filter in DataSift for such a situation looks like this:
(salience.content.sentiment > 0 and salience.title.sentiment < 0) or (salience.content.sentiment < 0 and salience.title.sentiment > 0)
We now have two new Salience features:
- Named entities
Topics are new to Salience 5. Entities are not new to Salience but they're new to DataSift.
Let's take a look at named entities first. The whole idea of entity analysis is to extract automatically the key components of a message. Here’s the list of the entities you can find:
- Company names
- Currency amounts
- Email addresses
- Family names
- Job titles
- Phone numbers
- Product names
Entity extraction brings you three major benefits:
- You automatically distinguish between, say, Aruba (the wireless networking company) and Aruba (the Caribbean island).
- You address positive and negative sentiment about separate entities. For example, think of a filter designed to analyze competitors such as Coke and Pepsi. The system assigns the sentiment for individual entities in a message.
- You’ve moved from a document-level sentiment to an entity level sentiment, so your analysis can be much more powerful.
So how does it work? First, Salience performs a parts-of-speech scan that uses a statistical model to recognize, from the sentence structure, the key elements. Another model then performs entity recognition to classify the various elements, the companies, people, places, products, and so on.
To understand this idea, think of the last time you read a magazine article; you probably knew that one particular word represented a company, even if it was a company you'd never heard of. Also, you probably appreciated that two other words, adjacent to each other, were the name of an executive at that company. Let me give you an example:
Acme Internet announced today it's new high-speed dialup service. Founder John Ode said that customers have been camping outside its Mountain View headquarters for days to grab the first Acmodems. Call (555) 123-4567 for more information.
Of course, there are always exceptions. T.J. Maxx is a US company, but a reader unfamiliar with the US retail sector might guess that it was an individual unless the content provided clarification. This message is clear:
T.J. Maxx announced Q1 profits today that exceeded analyst expections. The stock rose 3% in pre-market trading.
While this message might be unclear:
Did you read that piece about T.J. Maxx in the WSJ today?
The key point to keep in mind here is that the system is based on statistical models, not lists. This choice is deliberate because it enables us to recognize an entity without first configuring it in the system. It’s important when, for example, a news story breaks. Some of the names or organizations involved might be unknown to the general public but entity detection will work immediately, allowing you to track the story as soon as it breaks.
Topics help you to understand what a post is really about. Topic analysis is based on an analysis of the complete text of Wikipedia. The Topics engine extracted 1.1 million words and bi-grams (pairs of words) from Wikipedia and examined how often these words occurred and how they were linked to other articles to determine the ‘distance’ between these words. For example: "cat" is close to "tiger", it's less close to "mammal", and it's less close still to "car". The issue of relative distance is useful: "cat" is closer to "dog" than it is to "fish", for instance.
The engine works by using internal definitions. For instance, here’s the definition for food:
Food : food, meals, vegetable, meat, fruit
A post doesn't need to contain "food", "meals", "vegetable", "meat", or "fruit" verbatim; even references to these elements are sufficient to trigger a match. Consider this Tweet:
Dinner party tonight. Bought chicken, scallions, cherries, corn, asparagus, ginger, bean sprouts, olives, naga salsa. #stirfry #experiment
Salience would probably recognize it as belonging to the "Food" topic, though opinion at the author's dinner table might be divided.
Just as we saw for entity analysis, the advantage of using topics is that they catch references to food without you having to configure anything in advance.
For the detection to function properly, a message has to contain one of the 1.1 milion words or bi-grams. It currently has a database of 56 milllion links between these words, and it's these links that make the concept work. Essentially, the engine asks whether any of the words or bi-rams (using those 56 millions links) comes close to those five items in the definition list for "food".
The technique is capable of capturing subtle differences. For example: "I like chicken" maps to "food" but "I like chickens" maps to "agriculture".
The advantage to DataSift customers is that we now have an easy way to capture and report on broad and important classes.
Our Salience augmentation has advanced a long way since we started. By combining traditional sentiment analysis with entity detection and topic analysis, you can create some remarkably powerful and subtle filters. We'll Tweet some examples in the coming days.