Data Mapping

A .INI file defines which elements from the JSON data you want to store in which tables and columns in your database.

An easy way to create .INI files is to use the SQL Schema Mapper, a graphical tool built into DataSift. Whenever you create or edit a SQL-style data destination, DataSift offers you the choice of:

  • handcoding a .INI file and uploading it
  • using the graphical SQL Schema Mapper

All of the features of .INI files described on this page are available to you in the SQL Schema Mapper.The following video shows how to use SQL Schema Mapper.

What is in a .INI file?

In your .INI files you will define:

The simplest .INI file does not require any of these. It just defines mappings from the JSON to your database tables like this:

; This simple example writes to the INTERACTION table
; INI files can contain comments such as this and blank lines

[INTERACTION]
ID = interaction.id
USERNAME = interaction.author.username 
TEXT = interaction.content

This .INI file is complete and ready to run.

The first line indicates that you want to write data to the INTERACTION table in your database.

For every JSON object that DataSift pushes to your database, the second line examines the contents of the interaction.id element in the JSON and writes to the ID column in a new row of your INTERACTION table.

Similarly, the next line in the .INI file takes the content of interaction.author.username and writes it to the USERNAME column, and the final line takes the content of interaction.text and writes it to the TEXT column.

All three of these JSON elements are strings, so we write them to the database table as strings.

Data types

You can leave it to DataSift to choose the most appropriate data type when it writes to your tables, but sometimes you'll want to specify the type. For example, you might face a situation where you don't need the accuracy of a floating point number, and cast it as an integer in order to save storage space:

[INTERACTION]
FOLLOWER_RATIO = twitter.user.follower_ratio (data_type: integer)

To specify a data type, use the data_type keyword in your .INI file. Notice that the syntax requires parentheses and the data_type keyword.

Remember that if you leave DataSift to select the best data type, it might do something you don't expect. To be entirely certain that things are exactly as you want them, you can use the data_type parameter for every field that you write to your database.

Here's a complete list of the data types:

Type Connector Description
string MySQL, PostgreSQL Cast as a string, when we have a non-scalar we JSON encode the string.
integer MySQL, PostgreSQL Cast as an integer.
float MySQL, PostgreSQL Cast as a float.
datetime MySQL, PostgreSQL Create DateTime object that is then formatted correctly for each type of database (MySQL, PostgreSQL, and so on).
string-array PostgreSQL Array or nested arrays, where each value is cast as a string.
integer‑array PostgreSQL Array or nested arrays, where each value is cast as a integer.
float‑array PostgreSQL Array or nested arrays, where each value is cast as a float.

PostgreSQL allows you to store arrays within a single table row. We have provided the string-array, integer-array, and float-array types so that you can code your INI file to use the PostgreSQL-only feature:

[TWITTER]
INTERACTION_ID = interaction.id
CONTENT = twitter.text
HASHTAGS = twitter.hashtags (data_type: string-array)
MENTION_IDS = twitter.mention_ids (data_type: integer-array)

The SQL to create the PostgreSQL table might be:

CREATE TABLE TWITTER (
    INTERACTION_ID  varchar(64),
    CONTENT         text,
    HASHTAGS        text[],
    MENTION_IDS     int[]
);

Transforms

Transforms perform a range of useful actions:

Type Description
datetime Convert a date string (almost any format) or unix timestamp into a DateTime object. Each database type can then form this into their own date format at INSERT statement creation time
unixtimestamp Convert any date string into a unix timestamp
exists If the source of the mapping exists then it will return true, otherwise false.
first Return the first item in the source array
domain Extract the domain from a URL string. Otherwise return null value. Uses same technique as LinksResolution.
count Count the number of items in an array.

To specify a transform, use the transform keyword in your .INI file.

For example, suppose you want to build a table of Twitter users who either Tweet about your brand or Retweet other people's Tweets about you. You can use the exists transform to determine whether a particular interaction holds Retweet and set a flag to True (for a Retweet) or False otherwise:

[INTERACTION]
ID = interaction.id
USERNAME = interaction.author.username 
IS_RETWEET = twitter.retweeted.id (transform: exists)

If your database supports Boolean values, the data in that column will appear as True or False; if not, it will appear as 1 or 0. You can force DataSift to use 1 and 0 like this:

[INTERACTION]
ID = interaction.id
USERNAME = interaction.author.username 
IS_RETWEET = twitter.retweeted.id (data_type: integer, transform: exists)

Sometimes you will need to use transform and data_type together. For example, the created_at element in a JSON interaction is stored as a string such as this:

"Tue, 12 Nov 2011 14:17:55 +0000"

You could store that string in your database but it might be more valuable to convert it to the particular date/time format that your database uses.

[INTERACTION]
ID = interaction.id
USERNAME = interaction.author.username
CREATED = interaction.created_at (transform:datetime, data_type: datetime)

In this case, the .INI file takes the interaction.created_at string and uses the datetime transform to convert it to a datetime object; then it uses the datetime data type to write it in your database's native form. The order in which you specify these parameters does not matter because the data_type is always applied as the last step, just before we send the data to your database.

Iterators

Iterators allow you to process entire arrays of JSON strings or integers, writing one table row for each element in the array. There are several different iterators you can use, and the choice depends on the nature of the JSON objects you want to process and the kind of table structure you want to write to.

The iterator types are:

Iterator Purpose
list_iterator You specify a leaf node in the JSON tree. DataSift takes the element there, which can be an array of strings, and converts each string to one row in your database.
parallel_iterator You specify a node in the JSON tree. DataSift examines all the arrays immediately below that node and iterates through their elements. The first elements of each array are stored together in one table row, the second elements are stored in another row, and so on.
leaf_iterator You specify a starting point in the JSON. DataSift navigates to a leaf node and writes one row for each element it finds there. Useful in cases where you have a key representing a path through a JSON tree to a node that contains a value.
objectlist_iterator You specify a node in the JSON tree. DataSift examines all the objects immediately below that node and iterates through their elements.
path_iterator You specify a starting point in the JSON tree and leave the system to navigate down until it finds a scalar or array value. It generates a key-value pair where the key is the dot notation for the path through the JSON to the leaf node and the value is the scalar/array value of the leaf.

Conditions

Conditions provide a way for you to check whether data is an array or a floating point number.

Type Description
is_array Determines whether the value is an array. If so, it returns the original array. If not, it returns null.
is_float Determines whether the value is a float. If so, it returns the original value. If not, it returns null.

The primary use of conditions is to process the elements in JSON objects that are generated by DataSift's tagging and scoring functionality. Scoring produces numeric elements in the JSON whereas tagging produces arrays of strings; for example:

{
    "interaction": {
        "user": "datasift",
        "id": "12345678901234567890123456789012",
        "title": "Example",
        "content": "I like Superman, Batman, Incredible Hulk, and Iron Man.",
        "tag_tree": {
            "creator": [
                "Marvel Comics",
                "DC Comics"
            ],
            "abilities": [
                "Flight"
            ],
            "mobility": [
                "X-ray vision"
            ],
            "speed": {
                "Superman": 500.000000000000000,
                "Batman": 50.000000000000000
            }
        }
    }
}

Using conditions, you can support both types of these in a single table with columns for the path in the JSON tree, scores stored as floating point numbers, and tags stored as strings.

[INTERACTION_TAGS :iter = tree_iterator(interaction.tag_tree)]
ID = interaction_id
CREATED_AT = interaction.created_at (data_type: datetime, transform: datetime)
TAG_NAME = :iter._path
TAG_VALUE_FLOAT = :iter._value float (condition: is_float)
TAG_VALUE_ARRAY = :iter._value string-array (condition: is_array)

Working with Subscriptions

There are some elements in the interaction specifically designed to allow you to store details of your Push Subscriptions in the database.

This element: Holds this information:
subscription.id Push subscription id
subscription.hash Stream hash or Historics id
subscription.hash_type Either "stream" or "historics"

Here's a sample .INI file that writes to a table called SUBSCRIPTION:

[SUBSCRIPTION]
SUB_ID = subscription.id
SUB_HASH = subscription.hash
SUB_HASHTYPE = subscription.hash_type

This video provides a summary of configuring INI files for MySQL.