Understanding Chunks

DataSift's massively parallel architecture allows us to divide an Historics query into a collection of smaller tasks and run them simultaneously to give you the highest performance possible. Each of these smaller tasks acts on a "chunk" of data.

In the case of Historics, a chunk is a 24 hour period measured from midnight to midnight, UTC. In this case, if you create an Historics query that begins at 4pm UTC on one day and ends at 10am UTC the following day, it will run for 8 hours in one chunk and 10 hours in another.

Let's look at another example. Suppose you write CSDL code to look for mentions of a new beverage, Data Cola:

interaction.content contains "Data Cola"

and you want to run that from January 1, 2016 to Jan 31, 2016.

You send that CSDL to the /historics/prepare endpoint and you create an Historics query that begins at one second past midnight on January 1 and ends at midnight on January 31. If the chunk size is one day, DataSift will create 31 chunks and process one entire day of your query in each chunk.

Each of these 31 tasks runs in parallel, so DataSift delivers the data from each day as soon as DataSift finds it. As a consequence, the sequence in which it reaches you is not chronological.

To determine the progress of a parallel Historics query, hit the /historics/get endpoint and specify the Historics id of your Historics query (the id that you received when you hit the /historics/prepare endpoint. You can call the historics/get endpoint without specifying an Historics id but, if you do, it will not return information about chunks.