In a previous post I introduced you to the new DataSift library, a repository of extremely useful classifiers that will help you build great social solutions quickly. In this post using a practical example I'll show just how useful the classifiers are and how they can help you to start tackling a real-world scenario in minutes.
Analysing Sony's PS4 Launch
On the 14th of November at 11pm eastern time, Sony held their launch event for the PlayStation 4. Naturally there was a large volume of conversation taking place on social platforms around this time.
I decided to dig into this conversation as I wanted to learn:
Which news providers were the most influential amongst the interested audience
What types of people engaged with the PS4 launch
How did each of the above change over time?
Weapons Of Choice
To carry out my analysis I would need to:
Extract historical Twitter data for the period around the event
Push this data to a store where I could perform my analysis
Aggregate and visualise the data so I could look for insights
Although this sounds like a lot of work, with the right tools it was quick and easy:
DataSift Historics - provides straight forward access to historic data
MySQL Destination - the new MySQL destination makes it really easy to push data to a MySQL database
Tableau - an awesome tool that makes data visualisation a breeze. It is able to pull data directly from a MySQL database.
So these three work perfectly in harmony, but I am missing one piece. How could I make sense of the data?
Tableau is great tool, but I needed to deliver data to it in a structure that made sense for my analysis. In this case I needed to add the following metadata to each social interaction so that I could aggregate data appropriately:
The news provider whose content is being shared
The profession of the user who posted the interaction
Classifiers To The Rescue
Step in the DataSift Library, the hero of our piece. By using classifiers from the library I could attach the required metadata through tagging rules, taking advantage of the new VEDO features.
By using three library items my life became incredibly easy:
Twitter Source - This classifier tells you what application was used to generate content, and splits these applications into buckets including 'manual' (created by a human) and 'automatic' (created by a bot or automated service). As I wanted to focus on engagement by real people, I used this classifier to exclude automated tweets.
News Providers & Topics - This classifier identifies mentions of well known news providers by inspecting domains of links being shared in content. I used this classifier to tag content appropriately with the news provider being mentioned.
Professions & Roles - This classifier identifies the profession of Twitter users from their profile. I used this classifier to tag content with the profession of the author.
Each of the items in the library has a hash, found on it's library page, which is used to apply it to a stream:
Building My Stream
I started by writing a filter to match tweets relating to the launch event. I decided to look for interactions that mentioned PlayStation and keywords (launch, announces, announce). I looked both in the content of the post and also in the title of any content being shared through links.
Note: Links are automatically expanded and decorated with metadata by the Links augmentation.
With my filter created I now wanted to apply metadata to the matching interactions. I used the tags keyword to import the library items I chose, each with the appropriate hash:
My complete stream definition therefore became:
Data With Meaning
The advantage of using the library definitions is that the data is given meaning through metadata, so that when it reaches the application (Tableau in this case) it is much easier to process.
When an interaction arrives it will have a tag_tree property, containing any matching tags. So for example due to the library classifiers I might receive an interaction with the following metadata under interaction.tag_tree:
Gathering The Data
To store the historic data, I set up a MySQL Destination for the destination of the historic query.
I created a fresh new database and used this script to create the necessary tables. I used this mapping file as my mapping definition.
I will cover the MySQL destination very soon in a separate post. You can read the technical documentation here.
I then ran a historic query for Twitter between 13th November 2013 5am GMT and 16th November 5am GMT. 5am being midnight eastern time, so giving me data for the day before, day of and day after the launch event.
Finally I wrote some simple SQL views to augment the data ready for Tableau.
I used the tagged_interactions view for my analysis in Tableau.
Create A Tableau Dashboard
With the metadata attached to each interaction and exposed through my view, analysis with Tableau was quick and painless. I used Tableau to follow how coverage and engagement changed from before until after the launch.
Firstly I created a time series chart, showing the volume of tweets and retweets during the three day period. I plotted the created_at time against the number of records.
I then created a chart for both professions and news providers. The columns I created for the tags were perfect for Tableau.
For professions, I first filtered by tweets that were created manually (thanks to the Twitter Source library item). I then plotted profession (thanks to the Professions & Roles library item) against the number of records.
Similarly for news providers, I first filtered by tweets that were created manually and then plotted provider (thanks to the News Providers & Topics library item) against the number of records.
Finally I used the time series chart as a filter for the other two charts, combining the three on a dashboard so that I could look at activity across different time periods.
Day Before Launch
The day before the launch the major news providers didn't see great engagement from their audience. The Guardian and CNET managed to generate most engagement. Certainly both providers have a reputation for excellent technology news coverage.
The launch at this point engaged journalists, technologists, entrepreneurs and creative professions. This story is what I would have expected as these groups follow technology and gaming trends most closely in my experience.
Day Of Launch
The day of the launch, the audience engages more widely with a variety of providers. Digital Spy takes a huge lead though with users sharing it's content most widely.
As far as professions is concerned, the audience is split similarly. The news of the launch has not been engaged with widely so far.
Day After Launch
As the launch was at 11pm, it's not that surprising that only fans would have engaged.
However, looking at the engagement after the launch, at this point major providers such USA Today and CNN are managing to engage their audiences. As you might expect the number of professions engaging broadens as the news of the launch is covered more in the mainstream press.
At this point I could have dug deeper. For instance I could have included further news providers that may specialise in such an event. This example though was designed to show how quickly the library can get you started.
Until Next Time...
It's difficult to convey in a blog post, but by using the tagging features of VEDO, the ready-made library definitions and the new MySQL destination, what could have been a couple of days work was complete in a couple of hours.
Not having to post-process the data by writing and running scripts, and having reusable rules I could just pull out of the library made the task so much easier. Also the MySQL destination makes the process of extracting data and visualising it in Tableau completely seamless.
I hope this post gives you some insight into how these tools can be combined effectively. Please take a look at the ever-growing library and see it can make your life easier.
To stay in touch with all the latest developer news please subscribe to our RSS feed at http://dev.datasift.com/blog/feed