Tencent Weibo - A Key Piece In Your Chinese Social Jigsaw
Understanding Tencent Weibo Interactions
Types of Interaction
- tencentweibo.author.followers_count: The number of followers a user has
- tencentweibo.author.following_count: The number of users the user follows
- tencentweibo.author.statuses_count: The number of posts the user has created
Tencent In Action
A Helping Hand From VEDO
Putting It All Together
Over To You...
The Challenges Of East Asian Social Data
- Access to data: Almost all conversations take place on local social networks, rather than Twitter and Facebook. The ecosystem around these local networks has been less mature, and therefore gaining access has been more challenging.
- Inadequate tooling: Even if you could gain access to these sources, the vast majority of tools are heavily biased towards European languages, trained on spacing and punctuation which simply don't exist in East Asian text. Inadequate tooling leads to poor and incomplete insights.
Moving Beyond Substring Matching
Try It For Yourself
Principles Of Tokenization
- Once we have options for where word-boundaries sit (Ali? Alice?, Alicew?), the number of possibilities could exponentially increase in the worst case and
- Numerical score may help to rank the possible outcomes.
- You cannot apply a simple string searching algorithm for searching keywords (i.e. search for the sub-string (まじ) within the text as it can easily introduce errors )
- The decision whether or not to tokenize can be affected by surrounding characters.
Approaches For Japanese Tokenization
Putting It Into Practice: Tips For Japanese CSDL
Advanced Filtering - Stemming
Working Example For Japanese Geo-Extraction
- Tokenization is an important technique to extract correct signals from East Asian languages.
- N-gram and Morphological analysis are the two main techniques available.
- Datasift has implemented a noise-tolerant Morphological approach for Japanese with some extensions to handle new words accurately.
- By adding our new modifier [language(ja)] in CSDL, you can activate our Japanese tokenization engine in our distributed system.
- We can mix Japanese and other languages within a CSDL filter to realize unified and centralized data analysis.
The LexisNexis Source
How Could You Use It?
- Reputation management: Spot important trends, new opportunities and potential threats and act on them before anyone else. By monitoring news content you can proactively monitor negative opinions, adverse developments and identify risks. Alongside LexisNexis you could add social data sources, so monitor reputation on both social networks and published media.
- Opportunity identification: By staying on top of the latest news stories, companies can anticipate customers' emerging needs and stay one step ahead of their competition. LexisNexis covers newspapers, press releases, specialist trade journals and regional publications so you can stay on top of breaking news.
- Risk monitoring: There are many factors that can impact business performance, including the state of local economies, political upheaval and legislative change. Using LexisNexis news and legal coverage, keep abreast of issues that impact your suppliers and clients, and changes in local markets that could harm your business around the globe.
An Example Filter
LexisNexis SmartIndexing Technology™
LexisNexis + VEDO
Can The LexisNexis Source Help You?
One key challenge for developer creating a solution is integrating, often many, data sources. DataSift destinations take away this headache, especially the recently released MySQL destination.
The MySQL destination allows you to map and flatten unstructured data to your database schema, avoid writing needless custom integration code and handles realtime delivery challenges such as dropped connections so you don't have to.
Relieving Integration Challenges
The DataSift platform offers many awesome places to push data, but often let's face it, we all like to see data in a good old fashioned database. Relational databases such as MySQL are still the backbone of enterprise solutions.
Receiving a stream of unstructured data, structuring it, then pushing the data into a relational database can cause a number of headaches. The new MySQL destination makes the job straight forward so that you can concentrate on getting maximum value out of your data. It provides the following features:
- Guaranteed delivery - Data delivery is buffered and caters for dropped connections and delivery failure
- Delivery control - Data delivery can be paused and resumed as you require under your control
- Data mapping - Specify precisely how you want fields (within each JSON object) to be mapped to your MySQL schema
These features combined make pushing data from DataSift into a MySQL database extremely easy.
The MySQL Destination
As with any other type of destination, the easiest way to get started is to go to the Destinations page. Choose to add a new MySQL destination to your account.
Note that the MySQL destination is only currently available to enterprise customers. Contact your sales representative or account manager if you do not see the destination listed in your account.
To set up the destination you need to enter a name, the host and port of your MySQL server, the destination database schema and authentication details.
You also need to provide a mappings file. This file tells the destination which fields within the JSON data you would like to be mapped to tables in your database schema. More details on this in a minute.
It's worth using the Test Connection button as this will check that your MySQL server is accessible to our platform, the database exists, the security credentials are valid and that the mapping file is valid.
Note that you can also create the destination via our API. This process is documented here.
Mapping Data To A Schema
The basic connection details above are self-explanatory, but the mapping file definitely needs a little more explanation. There are many things to consider when mapping unstructured data to a relational set of tables.
Let me take you through an example schema and mapping file to help clarify the process. These have been designed to work with Twitter data. The two files I'll be discussing are:
In the example schema the following tables are included, which give us a structure to store the tweets.
- interaction - Top-level properties of each interaction / tweet. All tables below reference interactions in this table.
- hashtags - Hashtags mentioned for each interaction
- mentions - Twitter mentions for each interaction
- links - Links for each interaction
- tag_labels - VEDO tags for each interaction
- tag_scores - VEDO scores for each interaction
The example schema is quite exhaustive, please don't be put off! You can more than likely use a subset of fields and tables to store the data you need for your solution. You might also choose to write views that transform data from these tables to fit your application.
Now's not the time to cover MySQL syntax, I'm sure if you're reading this post you'll be used to creating schemas. Instead I'll move on to the mapping file, which is where the magic lies.
The mapping file allows you to specify what tables, columns and data types the raw data should be mapped to in your schema. I can't cover every possibility in one post, so for full details see our technical documentation pages. To give you a good idea though, I'll pick out some significant lines from the example mapping file.
Let's pretend we have the following interaction in JSON (I removed many fields for brevity):
Tables, Datatypes & Transforms
The first line tells the processor you want to map the following columns of the 'interaction' table to fields in the JSON structure.
The next line, tells the processor to map the path interaction.id to the interaction_id column of the table:
interaction_id = interaction.id
Skipping a couple of lines, the following tells the processor to map interaction.created_at to the created_at column. You'll notice though that we have additional data_type and transform clauses.
created_at = interaction.created_at (data_type: datetime, transform: datetime)
If you don't explicitly specify a data_type then the processor will attempt to decide the best type for itself by inspecting the data value. In the majority of cases this is perfectly ok, but in this line we ensure that the type is a datetime.
The transform clause gives you access to some useful functions. Here we are using the datetime function to cast the string value in the data to a valid datetime value.
Later on for the same table you'll see this line which uses a different transform function:
is_retweet = twitter.retweeted.id (data_type: integer, transform: exists)
Here the function will return true if the JSON object has this path present, otherwise it will return false.
Now let's move down to the hashtags table mapping. You'll see this as the first line:
[hashtags :iter = list_iterator(interaction.hashtags)]
This table mapping uses an iterator to map the data from an array to rows in a table. The line specifies that any items within the interaction.hashtags array should each be mapped to one row of the hashtags table. For our example interaction, a row would be created for each of 'social' and 'marketing'.
Note that we can refer to the current item in the iterator by using the :iter variable we set in the table mapping declaration:
hashtag = :iter._value
Here _value is a reserved property representing the value of the item in the array. You can also access _path which is the relative path within the object of the value. If we were using a different type of iterator, for example over an array of objects we could reference properties of the current object, such as :iter.id.
There are a number of iterators you can use to handle different types of data structure:
- list_iterator - Maps an array of values at the given path to rows of a database table.
- objectlist_iterator - Like list_iterator, but instead is used to iterate over an array of objects, not simple values.
- path_iterator - Flattens all properties inside an object, and it's sub objects, to give you a complete list of properties in the structure.
- leaf_iterator - Like path_iterator, however instead of flattening object properties, instead flattens any values in arrays within the structure to one complete list.
- parallel_iterator - Given a path in the JSON object, this iterator takes all the arrays which are children and maps the items at each index to a row in the table. This is particularly useful for working with links.
The iterators are powerful and allow you to take deep JSON structures and flatten them to table rows. Please check out the documentation for each iterator for a concrete example.
As a further example, the following line specifies mapping for VEDO tags that appear in the tag_tree property of the interaction:
[tag_labels :iter = leaf_iterator(interaction.tag_tree)]
Here we are mapping all leaves under interaction.tag_tree to a row in the tag_labels table.
The final feature I wanted to cover is conditions. These are really useful if you want to put data in different tables or columns depending on their data type.
Although this might sound unusual, returning to our example this is useful when dealing with tags and scores under the tag_tree path.
Under the mapping declaration for the tag_labels table, there is this line:
label = :iter._value (data_type: string, condition: is_string)
This states that a value should only be put in the table if the value is a string. You'll see a very similar line for the tag_scores table below, which does the same but insists on a float value. The result is that tags (which are text labels) will be stored in the tag_labels table, whereas scores (which are float values) will be stored in the tag_scores table.
That concludes our whirlwind tour of the features. Mapping files give you a comprehensive set of tools to map unstructured data to your relational database tables. With your mapping file created you can start pushing data to your database quickly and easily.
This was quite a lengthy post, but hopefully it gave you an overview of the possibilities with the new MySQL destination. The key being that it makes it incredibly easy to push data reliably into your database. I've personally thrown away a lot of custom code I'd written to do the same job and now don't think twice about getting data into my databases.
To stay in touch with all the latest developer news please subscribe to our RSS feed at http://dev.datasift.com/blog/feed