Richard Caudle | 14th July 2015
You might have seen our recent announcement covering many things including the launch of our new FORUM data science initiative. We're looking to share more of our experience and research to help innovation in social data analysis.
One of the first things we wanted to share is our work exploring the relationships between keywords and terms in social posts. Our data science team has been researching this area using word2vec - a data science library that models the relationship / similarity between words.
Using word2vec we've started to create models that are trained from large numbers of historical social posts. We're calling these Keyword Relationship Models and have released our first today. You can explore the model using our Keyword Relationship Explorer tool which we announced in another post.
These models are extremely useful as they reflects how language and keywords are really used by people. You can use such models to improve your filters, classification rules and analysis by making sure you include the key terms, phrases and misspellings used around your topic and keep track of them as they evolve over time.
In this post we thought we'd share more detail on how we trained the underlying model.
Challenges of working with keywords
Identifying and expanding on keywords and terms is a key challenge when filtering, classifying and analyzing text data.
The two key challenges you face when working with keywords are:
- Coverage - it’s not easy to get a comprehensive list of all the keywords representing a concept. You need to consider variations, synonyms, misspells, new terms, slang and so on.
- Noise - often keywords may have multiple meanings and therefore their ambiguity could introduce irrelevant data in the classification. This leads to false positives in your results.
By training Keyword Relationship Models we can tackle these challenges as by starting from a seed term such as 'coffee' we can easily learn similar terms to include (such as cappuccino, caffeine, cafe) but also learn which words to exclude (such as bean, plant, roasting) - of course these depend on your use case.
How we build Keyword Relationship Models
A Keyword Relationship Model is based upon word2vec, an open-source library that uses deep learning to represent words and n-grams as vectors in a multidimensional space. Essentially word2vec processes raw text and produces a model which is a representation of words in a vector space. The highly dimensional space created gives words a concept of similarity.
We created the example model through these following steps.
1) Retrieved historic data
We pulled together 2 months of posts in English from our historic archives of social network traffic.
In total this set of interactions was almost 4 billion posts.
2) Pre-processed content
Next we pre-processed content in order to standardize tokens.
Pre-processing steps included:
- Removing non-ascii characters
- Lower-cased all content
- Separated consecutive hashtags and mentions with a white space
- Replaced newlines with whitespaces
- Replace &, & with " and "
- Replaced other HTML entities with a white space
- Removed hyphens within words
- Removed single quotes within words
- Replaced single and consecutive punctuation characters with a single white space
- Collapsed whitespaces - consecutive whitespaces merged into a single one
- Alphanumeric strings are kept
- Hashtags are kept
- Replaced values for percentage, currency, time, date, mentions, numbers, links with unique placeholder tokens
- Removing duplicate posts
We then applied tokenization to the cleansed content based on whitespace.
For example if the original piece of content read as follows:
Tom I Love this new #smartphone :)!, the one I found costs 300$ here http://bit.ly/2r2fas
After pre-processing we arrived at:
3) Processed dataset using word2vec
After these steps the corpus was processed using Word2Vec (skip-gram version and 400 dimensions) to obtain the final model. Terms with less than 40 occurrences were not included while all the others, a total of over 3M, are part of the model including stopwords and placeholder tokens.
4) Queried word2vec model
Once the model was trained we could then query the model for similar words based on a seed term. For example in Python:
model = word2vec.Word2Vec.load_word2vec_format('<path to model file>', binary=True) similar_words = model.most_similar(positive=['#nfl'], topn=10) print similar_words
This is how our explorer tool was built.
You can start exploring our first model straight away using our Keyword Relationship Explorer tool.
Watch this space as we'll be releasing more models in future and giving you more ways to use them in your own projects.
Why not register your interest in our FORUM initiative so you learn more about bringing data science to your projects?