Parallel Iterator

Consider a Tweet that has multiple links. This JSON fragment shows an array of three links and another array that holds a title associated with each link.

{
    "interaction": {
        "author": {
            "username": "DataSiftDev",
            "name": "DataSift Developer",
            "id": 155505157,
...
        }
    },
    "links": {
        "created_at": [
            "Tue, 19 Nov 2013 11:22:09 +0000"
        ],
...
        "title": [
            "Big Data",
            "Cloud",
            "Data Science"
        ],
        "url": [
            "http://example.com/bigdata",
            "http://example.com/cloud",
            "http://example.com/datascience"
        ]
    },
    "twitter": {
        "created_at": "Tue, 19 Nov 2013 11:21:58 +0000",
...

You can use parallel_iterator to write this to a LINKS table in this format, with a row for each link.

ID NAME USERNAME TITLE URL
155505157 DataSift Developer DataSiftDev Big Data http://example.com/bigdata
155505157 DataSift Developer DataSiftDev Cloud http://example.com/cloud
155505157 DataSift Developer DataSiftDev Data Science http://example.com/datascience

Notice that the ID, NAME, and USERNAME columns contain the same data for each row, but the LINK_TITLE and LINK_URL columns contain data for each element in the JSON array. Also, the titles and urls correspond.

Here's a .INI file that builds this table:

[LINKS :iter = parallel_iterator(links)]
ID = interaction.id
NAME = interaction.name
USERNAME = interaction.username
TITLE = :iter.title
URL = :iter.url

Line 1 specifies that this block in the .INI file defines an iterator that writes to the LINKS table.

The next three lines are simple, static mappings of JSON data to the database.

Line 4 uses the iterator defined in line 1. In this case, the argument is 'links', which tells DataSift to begin at the links node in the JSON tree and write each .url element it finds to the URL column in the LINKS table.

Similarly, line 5 tells DataSift to write each .title element it finds to the TITLE column in the LINKS table.

Make sure you choose a starting point high enough in the JSON tree. For example, this .INI definition will iterate over the links.url, links.normalized_url and the twitter.display_urls array and keeps them in sync.

[LINKS :iter = parallel_iterator(.)]
id = interaction.id
created_at = interaction.created_at (data_type: datetime, transform: datetime)
type = interaction.type
url = :iter.links.url
normalized_url = :iter.links.normalized_url
display_url = :iter.twitter.display_urls