Blog posts in Engineering

Hiroaki Watanabe's picture
Hiroaki Watanabe
Updated on Monday, 24 March, 2014 - 11:30
At the heart of DataSift’s social data platform is a filtering engine that allows companies to target the text, content and conversations that they want to extract for analysis. We are proud to announce that we have expanded our platform to include Japanese, one of the fastest growing international markets for Twitter.
 

Principles Of Tokenization

This provides new challenges for how we can accurately filter to identify and extract relevant content and conversations. The main challenge to overcome is that Japanese, unlike Western languages, is written without the word boundaries (i.e. whitespace).
  
Imagine tackling this challenge in English to create a meaningful sentence from the sequence of characters from the first sentence of Lewis Carroll’s "Alice's Adventures in Wonderland".
 
Alicewasbeginningtogetverytiredofsittingbyhersisteronthebank,ando
fhavingnothingtodo:onceortwiceshehadpeepedintothebookhersisterw
asreading,butithadnopicturesorconversationsinit,'and whatistheuseof
abook,'thoughtAlice'withoutpicturesorconversation?'
 
image
You may find it easy to complete this task, but two important essences of Natural Language Processing (NLP) are involved in this exercise. From an algorithmic point of view:
 
  • Once we have options for where word-boundaries sit (Ali? Alice?, Alicew?), the number of possibilities could exponentially increase in the worst case and
  • Numerical score may help to rank the possible outcomes.
Let us see how these two points are relevant to Japanese Tweets. The following five characters form a popular sentence that can be tokenized into the two blocks of characters with meaning:
 
まじ面白い    == (tokenization) ==>     まじ  面白い
 
in which a white space is inserted between “じ” and “面”. In NLP, this process is called “tokenization” or "word chunking".
 
The meaning of this sentence is “seriously (まじ) interesting (面白い)". The first two characters, まじ, represent a popular slang often attached to sentimental words. Although “まじ” is a good indicator for sentiment, we can find them in other common words  (e.g., おまじない [good luck charm], すさまじい[terrible])) where the meaning of “まじ” (seriously) is no longer present. 
 
This simple Japanese case study highlights that:
 
  • You cannot apply a simple string searching algorithm for searching keywords (i.e. search for the sub-string (まじ) within the text as it can easily introduce errors )
  • The decision whether or not to tokenize can be affected by surrounding characters.

 

Approaches For Japanese Tokenization

In industry, there are two main approaches to solve this tokenization problem: (a) Morphological analysis and (b) N-gram. The N-gram approach generates blocks of characters systematically "without" considering the meanings from training examples and generates numerical scores by counting the frequency of each block. Because of this brute-force approach, the processing speed could be slow with large memory usage, however, it is strong for handling new “unknown words” since we do not need any dictionary.
 
In Datasift's platform, we implemented the Morphological approach for Japanese tokenization since it has advantages in terms of “speed” and “robustness for noise”. One drawback of the standard Morphological approach is its difficulty for handling unknown “new words”. Imagine the case where you see an unknown sequence of characters in the ‘Alice’ example.
 
Our software engineers have provided a great solution for this “new words” issue by twisting the standard Morphological approach. Thanks to our new algorithm, we successfully provide Japanese language service accurately for noisy Japanese Tweets without updating dictionary.
 

Putting It Into Practice: Tips For Japanese CSDL

If you are familiar with our filtering language (CSDL), you can apply our new Japanese tokenizer by simply adding a new modifier, [language(ja)], as follows:
 
interaction.content contains [language(ja)] "まじ" and
interaction.content contains [language(ja)] "欲しい"
 
Note that “欲しい” is “want” in English.
 
You can mix Japanese and other languages as well:
 
interaction.content contains [language(ja)] "ソニー" or
interaction.content contains "SONY"
 
Note that the keyword “ソニー” is analyzed using our Japanese tokenizer whereas our standard tokenizer is applied for the keyword “SONY” in this example.
 
Tagging (our rules-based classifier) also works for Japanese:
 
 
Note that the first two lines contains sentiments: “うれしい(happy)”, “楽しい(fun)”, “悲しい(sad)” and “楽しくない(sad)”.
 
Currently we support two main operators, “contains” and “contains_any”, for the [language(ja)] modifier. Our “substr” operator also works for Japanese although it may cause some noises as I explained above:
 
interaction.content substr "まじ"
 

Advanced Filtering - Stemming

An advanced tip to increase the number of filtering results is to consider the “inflection” of the Japanese language. Since Japanese is an agglutinative language, stems of words appear more often in Tweets. Our Morphological approach allows us to use “stem” as a keyword.
 
For example, the following CSDL could find tweets with “欲しい”, “欲しすぎて”, “欲しー”. :
 
interaction.content contains [language(ja)] "欲し"
 
It’s worth mentioning that there is no perfect solution for tokenization at the moment; N-gram approach has weakness for noise whereas the Morphological approach may not understand some of new words. If you find that a filter produces no output, you may try our “substr” operator which is our implementation of “string search algorithm”.
 
The above tagging example can be converted in a version that uses “substr” as follows:
 
 

Working Example For Japanese Geo-Extraction

Extracting users’ geological information is an interesting application. The following CSDL allows you to tag your filtered results using geo information, Tokyo (東京).
 
 
Note that “まじ” is used as a keyword for filtering in this example.
 

In Summary

  • Tokenization is an important technique to extract correct signals from East Asian languages.
  • N-gram and Morphological analysis are the two main techniques available.
  • Datasift has implemented a noise-tolerant Morphological approach for Japanese with some extensions to handle new words accurately.
  • By adding our new modifier [language(ja)] in CSDL, you can activate our Japanese tokenization engine in our distributed system.
  • We can mix Japanese and other languages within a CSDL filter to realize unified and centralized data analysis. 
 
Richard Caudle's picture
Richard Caudle
Updated on Thursday, 6 February, 2014 - 17:07

One key challenge for developer creating a solution is integrating, often many, data sources. DataSift destinations take away this headache, especially the recently released MySQL destination.

The MySQL destination allows you to map and flatten unstructured data to your database schema, avoid writing needless custom integration code and handles realtime delivery challenges such as dropped connections so you don't have to.

Relieving Integration Challenges

The DataSift platform offers many awesome places to push data, but often let's face it, we all like to see data in a good old fashioned database. Relational databases such as MySQL are still the backbone of enterprise solutions.

Receiving a stream of unstructured data, structuring it, then pushing the data into a relational database can cause a number of headaches. The new MySQL destination makes the job straight forward so that you can concentrate on getting maximum value out of your data. It provides the following features:

  • Guaranteed delivery - Data delivery is buffered and caters for dropped connections and delivery failure
  • Delivery control - Data delivery can be paused and resumed as you require under your control
  • Data mapping - Specify precisely how you want fields (within each JSON object) to be mapped to your MySQL schema

These features combined make pushing data from DataSift into a MySQL database extremely easy.

The MySQL Destination

As with any other type of destination, the easiest way to get started is to go to the Destinations page. Choose to add a new MySQL destination to your account.

Note that the MySQL destination is only currently available to enterprise customers. Contact your sales representative or account manager if you do not see the destination listed in your account.

 

To set up the destination you need to enter a name, the host and port of your MySQL server, the destination database schema and authentication details.

You also need to provide a mappings file. This file tells the destination which fields within the JSON data you would like to be mapped to tables in your database schema. More details on this in a minute.

It's worth using the Test Connection button as this will check that your MySQL server is accessible to our platform, the database exists, the security credentials are valid and that the mapping file is valid.

Note that you can also create the destination via our API. This process is documented here.

Mapping Data To A Schema

The basic connection details above are self-explanatory, but the mapping file definitely needs a little more explanation. There are many things to consider when mapping unstructured data to a relational set of tables.

Let me take you through an example schema and mapping file to help clarify the process. These have been designed to work with Twitter data. The two files I'll be discussing are:

MySQL Schema

In the example schema the following tables are included, which give us a structure to store the tweets.

  • interaction - Top-level properties of each interaction / tweet. All tables below reference interactions in this table.
  • hashtags - Hashtags mentioned for each interaction
  • mentions - Twitter mentions for each interaction
  • links - Links for each interaction
  • tag_labels - VEDO tags for each interaction
  • tag_scores - VEDO scores for each interaction

The example schema is quite exhaustive, please don't be put off! You can more than likely use a subset of fields and tables to store the data you need for your solution. You might also choose to write views that transform data from these tables to fit your application.

Now's not the time to cover MySQL syntax, I'm sure if you're reading this post you'll be used to creating schemas. Instead I'll move on to the mapping file, which is where the magic lies.

Mapping File

The mapping file allows you to specify what tables, columns and data types the raw data should be mapped to in your schema. I can't cover every possibility in one post, so for full details see our technical documentation pages. To give you a good idea though, I'll pick out some significant lines from the example mapping file.

Let's pretend we have the following interaction in JSON (I removed many fields for brevity):

 

Tables, Datatypes & Transforms

The first line tells the processor you want to map the following columns of the 'interaction' table to fields in the JSON structure.

[interaction]

The next line, tells the processor to map the path interaction.id to the interaction_id column of the table:

interaction_id = interaction.id

Skipping a couple of lines, the following tells the processor to map interaction.created_at to the created_at column. You'll notice though that we have additional data_type and transform clauses.

created_at = interaction.created_at (data_type: datetime, transform: datetime)

If you don't explicitly specify a data_type then the processor will attempt to decide the best type for itself by inspecting the data value. In the majority of cases this is perfectly ok, but in this line we ensure that the type is a datetime.

The transform clause gives you access to some useful functions. Here we are using the datetime function to cast the string value in the data to a valid datetime value.

Later on for the same table you'll see this line which uses a different transform function:

is_retweet = twitter.retweeted.id (data_type: integer, transform: exists)

Here the function will return true if the JSON object has this path present, otherwise it will return false.

 

Iterators

Now let's move down to the hashtags table mapping. You'll see this as the first line:

[hashtags :iter = list_iterator(interaction.hashtags)]

This table mapping uses an iterator to map the data from an array to rows in a table. The line specifies that any items within the interaction.hashtags array should each be mapped to one row of the hashtags table. For our example interaction, a row would be created for each of 'social' and 'marketing'.

Note that we can refer to the current item in the iterator by using the :iter variable we set in the table mapping declaration:

hashtag = :iter._value

Here _value is a reserved property representing the value of the item in the array. You can also access _path which is the relative path within the object of the value. If we were using a different type of iterator, for example over an array of objects we could reference properties of the current object, such as :iter.id.

There are a number of iterators you can use to handle different types of data structure:

  • list_iterator - Maps an array of values at the given path to rows of a database table.
  • objectlist_iterator - Like list_iterator, but instead is used to iterate over an array of objects, not simple values.
  • path_iterator - Flattens all properties inside an object, and it's sub objects, to give you a complete list of properties in the structure.
  • leaf_iterator - Like path_iterator, however instead of flattening object properties, instead flattens any values in arrays within the structure to one complete list.
  • parallel_iterator - Given a path in the JSON object, this iterator takes all the arrays which are children and maps the items at each index to a row in the table. This is particularly useful for working with links.

The iterators are powerful and allow you to take deep JSON structures and flatten them to table rows. Please check out the documentation for each iterator for a concrete example.

As a further example, the following line specifies mapping for VEDO tags that appear in the tag_tree property of the interaction:

[tag_labels :iter = leaf_iterator(interaction.tag_tree)]

Here we are mapping all leaves under interaction.tag_tree to a row in the tag_labels table.

 

Conditions

The final feature I wanted to cover is conditions. These are really useful if you want to put data in different tables or columns depending on their data type.

Although this might sound unusual, returning to our example this is useful when dealing with tags and scores under the tag_tree path.

Under the mapping declaration for the tag_labels table, there is this line:

label = :iter._value (data_type: string, condition: is_string)

This states that a value should only be put in the table if the value is a string. You'll see a very similar line for the tag_scores table below, which does the same but insists on a float value. The result is that tags (which are text labels) will be stored in the tag_labels table, whereas scores (which are float values) will be stored in the tag_scores table.

That concludes our whirlwind tour of the features. Mapping files give you a comprehensive set of tools to map unstructured data to your relational database tables. With your mapping file created you can start pushing data to your database quickly and easily.

Summing Up...

This was quite a lengthy post, but hopefully it gave you an overview of the possibilities with the new MySQL destination. The key being that it makes it incredibly easy to push data reliably into your database. I've personally thrown away a lot of custom code I'd written to do the same job and now don't think twice about getting data into my databases.

To stay in touch with all the latest developer news please subscribe to our RSS feed at http://dev.datasift.com/blog/feed

Richard Caudle's picture
Richard Caudle
Updated on Wednesday, 5 February, 2014 - 10:03

At DataSift we pride ourselves on the power and flexibility of our filtering engine. One feature customers have asked for though is the ability to use wildcards in text-based filter conditions. Wildcards are now available on the platform to make writing powerful queries even simpler.

Text Wildcards

Firstly, it's probably worth clarifying what a wildcard is in this context. The wildcard operator on DataSift allows you to match (when filtering and tagging) against text values where there is a range of possibilities.

This can be extremely useful when you need to cover a collection of similar terms, misspellings, and incorrect or absent letter accents.

Think of the wildcard operator as another weapon in your arsenal. Although not as powerful as a regular expression, a wildcard is more easily understood and created, and incurs a lower cost. Of course in some situations the precision of a regular expression may still be the best choice.

Wildcard Syntax

Imagine you have a situation where you'd like to write a filter for a term, but there are multiple variations of that term. This is common in many languages, for instance imagine you'd like to filter for anything relating to printers and printing. Keywords would include:

print, prints, printer, printers, printing, printable

You could write a regular expression to cover these possibilities, but let's be honest regular expressions though powerful can be a real headache.

Instead using wildcards you can now simply write:

interaction.content wildcard "print*"

Here the * matches any character 0 or more times and would match all the words in our list.

In fact there are two wildcard characters you can make use of:

  • * - Matches 0 or more characters
  • ? - Matches exactly one character

The ? character is useful when looking for strings of a known pattern. For instance imagine filtering for a word that is commonly misspelt. Such as relevance:

interaction.content wildcard "relev?nce"

In fact you can give the wildcard operator a list of terms to match, such as:

interaction.content wildcard "relev?nce, relev?nt"

The wildcard operator is documented here.

Query Builder Support

You can use the wildcard operator in CSDL or the Query Builder tool. In Query Builder the option is list alongside "contain any" and the other text operators:

Case Sensitivity

By default the wildcard operator is not case sensitive. You can though use the cs keyword to apply case sensitivity. For example to match common misspellings of Massachusetts:

interaction.content cs wildcard "Massachus*ts"

Often though authors are just as likely to not capitalise proper nouns, so I'd only recommend using this option when you are sure it is the appropriate implementation.

That's All For Now...

Hopefully you'll find the new operator will help you write your filters more easily than using a regular expression. Remember, it's just one of growing list of text operators that make text matching precise and powerful!

To stay in touch with all the latest developer news please subscribe to our RSS feed at http://dev.datasift.com/blog/feed

Richard Caudle's picture
Richard Caudle
Updated on Wednesday, 29 January, 2014 - 11:26

Recently I've covered the tagging and scoring features of DataSift VEDO. My post on scoring gave a top level overview and a simple example, but might have left you hungry for something a little more meaty. In this post I’ll take you through how we’ve started to build linear classifiers internally using machine learning techniques to tackle more complex classification projects. This post will explain what a linear classifier is, how it can help you and give you a method to get you started building your own.

What Is A Linear Classifier?

Until now you’re likely to have relied on boolean expressions to categorise your social data based on looking at data by eye. A linear classifier, made possible by our new scoring features, allows you to categorise data based on machine learned characteristics over much larger data sets.

A linear classifier is a machine learned method for categorising data. Machine learning is used to identify key characteristics in a training set of data and give each characteristic a weighting to reflect its influence. When the resulting classifier is run against new data each piece of data it is given a score for how likely the data is to belong in each category. The category with the highest score is considered the most appropriate category for the new piece of data.

Linear classifiers are not the most advanced or accurate method of classification, but they are a good match for high volume data sources due to their efficiency and so are perfect for social data. The accuracy of the classifier depends greatly on the definition of the categories, quality and size of the training set and effort to iteratively improve results through tuning.

For this post I will concentrate on how we built the customer service routing classifier in our library. This classifier is designed to help airlines triage incoming customer requests.

Development Environment

Before I start, we use Python for our data science development work. To make use of our scripts you’ll need the following set up:

The Process

To build a classifier you’ll need to carry out the following steps:

  1. Define the categories you want to classify data into and the data points you need to consider
  2. Gather raw data for the training set
  3. Manually classify the raw data to form the training set
  4. Use machine learning to identify characteristics and generate scoring rules
  5. Test the rule set and iterate to improve the classifier’s accuracy

Let’s look at each of these in detail.

1. Define Your Categories & Targets

The first thing you need to consider is what categories are you going to classify your data into. It is essential to invest time upfront considering the categories, and to write for each a strong definition and include a few examples. The more precise and considered you can be here, the more efficient the learning process can be and the more useful your classifier will become.

Make sure your categories are a good fit for their eventual use. You must make sure that no categories overlap and that you have categories so that all possible interactions are covered. So for example you might want to include an 'other' category as we did below.

For the airline classifier, we spent a good amount of time looking into the kind of conversations that surround airline customer services and were inspired by this Altimeter study. We wanted to demonstrate how conversations could be classified for handling by a customer services team.

Thee categories we finally decided on were:

  • Rant: An emotionally charged criticism that may damage brand image
    • “After tweeting negative comment about EasyJet, I have been refused boarding! My rights has been violated!!!”
  • Rave: Thanks or positive opinion about flight or services
    • “Landing during storm, saved by EasyJet pilot, thanks”
  • Urgency: Request for resolving a real-time issue, including compensation
    • “EasyJet Flight cancelled. I demand compensation now!”
  • Query: A polite or neutral question about how to contact the company, use the website, print boarding card etc.
    • “Where can I find EasyJet hand luggage dimensions?”
  • Feedback: Statement about the flight or service, relating to how it could be improved, including complaints for delays without big anger.
    • “Dear EasyJet, how about providing WiFi onboard”
  • Lead: Contact from a customer interested in purchasing a ticket or other product/service in the near future
    • “EasyJet, do you sell group tickets to Prague?”
  • Others: Anything that doesn’t fit into the categories above

 

As you might outsource the training process (explained later) to a third party or to colleagues clear definitions are extremely important.

With your categories defined, you now need to consider what fields of your interactions should be considered. For our classifier we decided that the interaction.content target contained the relevant information.

2. Gather Data For The Training Set

To carry out machine learning you will need to feed the algorithm a set of training data which has been classified by hand. The algorithm will use this data to identify features (keywords and phrases) that influence how a piece of content is classified.

To form the training set you can extract data from the platform (by running a historic query or recording a stream) and then manually putting each interaction into a category. If you choose to use our scripts use one of our push destinations to collect data as a JSON file choosing the JSON newline delimited format.

To gather raw data for our airline classifier we used the following filter:

We ran this filter as a historic query to collect a list of 2000 interactions as an initial training set. Of course the more data you are able to manually classify, the higher quality your final classifier is likely to be.

NOTE: Remember to remove any duplicates from the dataset you extract. Datasift guarantees to deliver each interaction at least once. If there is a push failure we will try to push data again, and you may receive duplicate interactions. If you are a UNIX platform you can do so at the command line:

sort raw.json | uniq -u > deduped.json

3. Manually Classify Data To Form The Training Set

Now that you have raw set of data, the interactions need to be manually assigned categories to form the training set.

As you are likely to have thousands of data points to classify, you may want to outsource this work. This is why it is vital to have well written definitions of your categories. We chose Airtasker to outsource the work. The advantage we found of Airtasker was that we could have assigned workers that we could communicate with and give feedback to.

We reformatted the raw JSON data as a CSV file to pass to our workers. The file contained the following fields:

  • interaction.id - Used to rejoin the categories back on to the original interactions
  • interaction.content - The field that the worker needs to examine
  • Category - to be completed by the worker

Again as with the training set size, the more effort you can spend here the better the results will be. You might want to consider asking multiple people to manually classify the data and correlate the results. Even with well written definitions two humans may disagree on right category.

With the results back from Airtasker we now had a raw set of interactions (as a JSON file) and a list of classified interactions (as a CSV file). These two combined formed our training set.

4. Generating A Classifier

With a training set in place the next step is to apply machine learning principles to generate rules for the linear classifier, and generate CSDL scoring rules to implement the classifier.

We implemented the algorithm in Python using the scikit-learn libraries, and the source is available here on GitHub.

At a high level the algorithm carries out these steps:

  • For each interaction in the training set, consider the target fields (in this case interaction.content)
    • Split into two sets, the first for training, the second for testing the classifier later
    • For each training interactions
      • Chunk the content into words and phrases
      • Build a list of candidate features to be considered for rules
  • Add / remove features based on domain knowledge (see below)
  • From the list of features select those with the most influence
  • Generate the classifier based on the selected features, and the interactions that match these features
  • Test the classifier against the training interactions and output results as a confusion matrix
  • Test the classifier against testing interactions put aside earlier
    • For each logging the expected and actual categories assigned
    • Outputting overall results as a confusion matrix
  • Generate CSDL scoring rules from the classifier

The script takes in a raw JSON file of interactions (unclassified) and a CSV of classified interactions, matching the method I’ve explained. You can also specify keywords and phrases to include or exclude as an override to the automatically selected features.

See the GitHub repository for instructions on how to use the script.

Domain Knowledge

The script allows you to specify keywords and phrases that must or must not be considered when generating the classifier. This allows you a level of input into the results based on human experience.

For example we specified the following words should be considered for the airline classifier as we knew they would give us strong signal:

urgent,urgently,today,refund,compensate,compensation,website,connecting,upgrade

5. Perfecting The Classifier

Your first classifier might not give you a great level of accuracy. Once you have a method working, you may need to spend considerable time iterating and improving your classifier.

You might want to extract a larger set of training data or you may wish to add or remove keywords as you learn more about the data.

The script also allows you to manipulate the parameters passed to the statistical algorithms. Refining these parameters can produce significantly different results.

Conclusion

I hope this post has given you some insight into building a machine learned classifier. It is impossible to give a full proof turnkey method as use cases vary so wildly.

As I said in the introduction, linear classifiers are suited to social data because of their efficiency. You may need to invest significant time perfecting your classifier, this is the nature of machine learning.

Check out our library for more examples of classifiers. We’ll be adding more linear classifiers soon!

To stay in touch with all the latest developer news please subscribe to our RSS feed at http://dev.datasift.com/blog/feed

Richard Caudle's picture
Richard Caudle
Updated on Wednesday, 22 January, 2014 - 12:55

In a previous post I introduced you to the new DataSift library, a repository of extremely useful classifiers that will help you build great social solutions quickly. In this post using a practical example I'll show just how useful the classifiers are and how they can help you to start tackling a real-world scenario in minutes.

Analysing Sony's PS4 Launch

On the 14th of November at 11pm eastern time, Sony held their launch event for the PlayStation 4. Naturally there was a large volume of conversation taking place on social platforms around this time.

I decided to dig into this conversation as I wanted to learn:

  • Which news providers were the most influential amongst the interested audience
  • What types of people engaged with the PS4 launch
  • How did each of the above change over time?

Weapons Of Choice

To carry out my analysis I would need to:

  1. Extract historical Twitter data for the period around the event
  2. Push this data to a store where I could perform my analysis
  3. Aggregate and visualise the data so I could look for insights

Although this sounds like a lot of work, with the right tools it was quick and easy:

  • DataSift Historics - provides straight forward access to historic data
  • MySQL Destination - the new MySQL destination makes it really easy to push data to a MySQL database
  • Tableau - an awesome tool that makes data visualisation a breeze. It is able to pull data directly from a MySQL database.

So these three work perfectly in harmony, but I am missing one piece. How could I make sense of the data?

Tableau is great tool, but I needed to deliver data to it in a structure that made sense for my analysis. In this case I needed to add the following metadata to each social interaction so that I could aggregate data appropriately:

  • The news provider whose content is being shared
  • The profession of the user who posted the interaction

Classifiers To The Rescue

Step in the DataSift Library, the hero of our piece. By using classifiers from the library I could attach the required metadata through tagging rules, taking advantage of the new VEDO features.

By using three library items my life became incredibly easy:

  1. Twitter Source - This classifier tells you what application was used to generate content, and splits these applications into buckets including 'manual' (created by a human) and 'automatic' (created by a bot or automated service). As I wanted to focus on engagement by real people, I used this classifier to exclude automated tweets.
  2. News Providers & Topics - This classifier identifies mentions of well known news providers by inspecting domains of links being shared in content. I used this classifier to tag content appropriately with the news provider being mentioned.
  3. Professions & Roles - This classifier identifies the profession of Twitter users from their profile. I used this classifier to tag content with the profession of the author.

Each of the items in the library has a hash, found on it's library page, which is used to apply it to a stream:

Building My Stream

I started by writing a filter to match tweets relating to the launch event. I decided to look for interactions that mentioned PlayStation and keywords (launch, announces, announce). I looked both in the content of the post and also in the title of any content being shared through links.

Note: Links are automatically expanded and decorated with metadata by the Links augmentation.

 

 

With my filter created I now wanted to apply metadata to the matching interactions. I used the tags keyword to import the library items I chose, each with the appropriate hash:

 

 

My complete stream definition therefore became:

 

 

Data With Meaning

The advantage of using the library definitions is that the data is given meaning through metadata, so that when it reaches the application (Tableau in this case) it is much easier to process.

When an interaction arrives it will have a tag_tree property, containing any matching tags. So for example due to the library classifiers I might receive an interaction with the following metadata under interaction.tag_tree:

 

 

Gathering The Data

To store the historic data, I set up a MySQL Destination for the destination of the historic query.

I created a fresh new database and used this script to create the necessary tables. I used this mapping file as my mapping definition.

I will cover the MySQL destination very soon in a separate post. You can read the technical documentation here.

I then ran a historic query for Twitter between 13th November 2013 5am GMT and 16th November 5am GMT. 5am being midnight eastern time, so giving me data for the day before, day of and day after the launch event.

Finally I wrote some simple SQL views to augment the data ready for Tableau.

 

 

I used the tagged_interactions view for my analysis in Tableau.

Create A Tableau Dashboard

With the metadata attached to each interaction and exposed through my view, analysis with Tableau was quick and painless. I used Tableau to follow how coverage and engagement changed from before until after the launch.

Firstly I created a time series chart, showing the volume of tweets and retweets during the three day period. I plotted the created_at time against the number of records.

I then created a chart for both professions and news providers. The columns I created for the tags were perfect for Tableau.

For professions, I first filtered by tweets that were created manually (thanks to the Twitter Source library item). I then plotted profession (thanks to the Professions & Roles library item) against the number of records.

Similarly for news providers, I first filtered by tweets that were created manually and then plotted provider (thanks to the News Providers & Topics library item) against the number of records.

Finally I used the time series chart as a filter for the other two charts, combining the three on a dashboard so that I could look at activity across different time periods.

The Results

Day Before Launch

The day before the launch the major news providers didn't see great engagement from their audience. The Guardian and CNET managed to generate most engagement. Certainly both providers have a reputation for excellent technology news coverage.

The launch at this point engaged journalists, technologists, entrepreneurs and creative professions. This story is what I would have expected as these groups follow technology and gaming trends most closely in my experience.

Day Of Launch

The day of the launch, the audience engages more widely with a variety of providers. Digital Spy takes a huge lead though with users sharing it's content most widely.

As far as professions is concerned, the audience is split similarly. The news of the launch has not been engaged with widely so far.

Day After Launch

As the launch was at 11pm, it's not that surprising that only fans would have engaged.

However, looking at the engagement after the launch, at this point major providers such USA Today and CNN are managing to engage their audiences. As you might expect the number of professions engaging broadens as the news of the launch is covered more in the mainstream press.

At this point I could have dug deeper. For instance I could have included further news providers that may specialise in such an event. This example though was designed to show how quickly the library can get you started.

Until Next Time...

It's difficult to convey in a blog post, but by using the tagging features of VEDO, the ready-made library definitions and the new MySQL destination, what could have been a couple of days work was complete in a couple of hours.

Not having to post-process the data by writing and running scripts, and having reusable rules I could just pull out of the library made the task so much easier. Also the MySQL destination makes the process of extracting data and visualising it in Tableau completely seamless.

I hope this post gives you some insight into how these tools can be combined effectively. Please take a look at the ever-growing library and see it can make your life easier.

To stay in touch with all the latest developer news please subscribe to our RSS feed at http://dev.datasift.com/blog/feed

Pages