DataSift's REST API allows you to work with our historical archive. Here are the steps. Notice that you're hitting a combination of endpoints from our Core, Historics, and Push APIs. If you're starting out, read our Introduction to Historics.
Chunking is an important concept that you need to understand when you're running Historics queries that last for more than a day or so.
Read Understanding Chunks to learn more.
How do I use the Historics API?
Create a filter in CSDL and call the /compile endpoint. Filters that hit the
Historics API are exactly like any other kind of DataSift filters. In fact, you can take a filter that you've already run against a live DataSift stream and run it on Historics without any change.
Make a note of the hash for your CSDL filter. The /compile endpoint returns a JSON object containing the hash.
It is a good idea to hit the /historics/status endpoint to determine whether the Historics archive has data coverage for the time interval you want to filter against. The endpoint also allows you to determine whether augmentations have been applied for that period. Additionally, the /preview/create and /preview/get endpoints allow you to check how much data your query will return.
Hit the /historics/prepare endpoint in the Historics API. You need to pass several parameters:
The start and end dates must both be in the past, and the end date must be later than the start date.
- the name you want to give to your Historics query.
- the start date and time and stop date and time, both in the format of Unix timestamps.
- the CSDL hash that you found in step 1.
- the data sources that you want to filter against (Facebook, Tumblr, LexisNexis and so on).
- This API call generates a Historics id which serves as a unique identifier for your Historics query. Make sure you note the Historics id.
Hit the /push/create endpoint in the Push API, passing it the Historics id that you generated with /historics/prepare.
This API call generates a Subscription and returns an id which serves as a unique identifier for that Subscription. Make sure you keep a note of the subscription id.
One of the parameters is called output_params. Here you can set the maximum amount of data that you want to receive in each POST request, and the minimum time between POST requests. Think about these settings; if you choose a small value for the data size and a long time interval, and then run a high-volume stream, you could lose data. We recommend that you test your streams in DataSift's UI first, to get an idea of the data volume that you can expect.
To start the Historics query running, hit the /historics/start endpoint. DataSift runs your CSDL filter against its Historics data between the start date and the end date that you supplied to the /historics/prepare endpoint.
If you need to stop a Historics query before its reaches the end timestamp that you specified in your call to the /historics/prepare endpoint, simply hit /historics/stop. Note that it is not possible to restart a job.
At any time, you can request simple statistics on any Historics query. Just call the /historics/get endpoint. The statistics include information about the status and progress updates for the whole job as well as for separate chunks. If you set with_estimate=1 in your call to historics/get, the statistics include your estimated job completion time.
- At any time, you can stop and delete a Historics query by calling /historics/delete. If you attempt to start that query again using the same Historics id, you will receive an error message. The call does not cause DataSift to delete the original CSDL code.
If you allow the Historics query to run until it finishes, DataSift will automatically shut down your Subscription once it has delivered all the data.
If you hit /historics/stop or /historics/delete, DataSift will change the Push status to "finishing", deliver any data that is still in the internal buffer, shut down your subscription, and write a message to the log indicating that delivery is complete.