Richard Caudle | 14th July 2015
Identifying and expanding on keywords and terms is a key challenge when filtering, classifying and analyzing text data. We're always looking at how we can make this challenge easier. One area we've been researching is finding relationships between words using word2vec.
Today we've released a tool which allows you to explore relationships between words. We've also created our first Keyword Relationship Model for you to explore. This model represents over three million unique terms in a four hundred dimensional space. The example model has been built by processing almost 4 billion social interactions in English posted during March and April 2015.
The model is extremely useful as it reflects how language and keywords are really used by people. You can use the model (via the explorer tool) to improve your filters, classification rules and analysis by making sure you include the key terms, phrases and misspellings used around your topic and keep track of them as they evolve over time. Similarity between terms, rather than being statically defined based on a structured taxonomy, is calculated by analysis of how people express themselves on social networks.
In the tool you can interact with the model by exploring terms, their similarity and how they relate to each other. You can start with any word, formal and informal, misspellings and neologisms, as long as:
- It is composed of a single term (only unigrams are currently included in the model)
- It appears at least forty times in the dataset
For example you could try exploring:
- Common terms - thanks, wish, love
- Companies - starbucks, nikon, bmw
- Products - f150, coke, applewatch
- People - obama, #ericschmidt, #nicolascage
- Places - london, france, beach
- Events - #oscar, eurovision, superbowl
- Occasions - holiday, wedding, birthday
- Hashtags - #data, #food, #fail
Note that punctuation has been removed, therefore if you want to use “f-150” the term “f150” has to be specified.
After you choose an initial seed term the list of top 10 related words are shown, sorted by similarity: the first term being the most similar and the last term being the least similar in the list to the initial term.
You can then click on any of the related terms to see the terms it relates to and so on, allowing you to explore the model to many levels of depth. At any point you can hover over the word to the see the similarity (on a scale of -1 to 1) between that word and the initial seed.
You can read more here about how we built the model using word2vec.