Understanding Chunks

DataSift's massively parallel architecture allows us to divide a Historics query into a collection of smaller tasks and run them simultaneously to give you the highest performance possible. Each of these smaller tasks acts on a "chunk" of data.

A chunk is defined by a unit of time. For instance, it might be from midnight to the following midnight. In this case, if you create a Historics query that begins at 4pm on one day and ends at 10am the following day, it will run for 8 hours in one chunk and 10 hours in another.

Let's look at another example. Suppose you write CSDL code to look for mentions of a new beverage, Data Cola:

twitter.text contains "Data Cola"

and you want to run that from January 1, 2012 to Jan 31, 2012.

You send that CSDL to the /historics/prepare endpoint and you create a Historics query that begins at one second past midnight on January 1 and ends at midnight on January 31. If the chunk size is one day, DataSift will create 31 chunks and process one entire day of your query in each chunk.

Each of these 31 tasks runs in parallel, so DataSift delivers the data from each day as soon as DataSift finds it. As a consequence, the sequence in which it reaches you is not chronological.

To determine the progress of a parallel Historics query, hit the /historics/get endpoint and specify the Historics id of your Historics query (the id that you received when you hit the /historics/prepare endpoint. You can call the historics/get endpoint without specifying a Historics id but, if you do, it will not return information about chunks.