- What are Historics queries?
- How can I find out what data you have for the time interval I want to look at?
- How do I set my start and end dates and times?
- When will you run my Historics query?
- Can I make my job start sooner?
- How long will my Historics query take to run?
- Can I run my jobs in parallel?
- How long does it take to filter a week of Historics data?
- The estimate tells me my query won't be completed until tomorrow. Why is that?
- Can I query Historics for more than one month?
- What data sources can I filter for in Historics?
- How do I recover my data if a Historics query fails while it is running?
- I ran a Historics query but no data was delivered. What do I do now?
- Will I be notified when a Historics query is complete?
The DataSift Historics archive is a large body of content gathered from a variety of websites. Historics is useful when you want to turn the clock back and filter against data from the past. It uses the same CSDL language that we use for live streaming. It typically works 100 times faster than live streaming. It offers 100 percent coverage or you can run it on a sample of 10 percent. Read more
Before you run a full Historics job you can use our Historics Preview feature, which uses a 1 percent sample of the archive to estimate how much it will cost to run the job.
When you set up a Historics query via the /historics/prepare API endpoint, you pass the start and end date/time as parameters.
- The start and end parameters are Unix timestamps. You can use an online converter to create them.
- You can define your start and end times to an accuracy of one second.
- The start time is inclusive and the end time is exclusive. In other words, If you want your query to run from 10:07:05 to 11:15:55, find the Unix timestamp for 11:15:55 on the day in question and add one to it.
- The maximum duration of a Historics query is one month. If you need to examine a longer period, run two or more queries and combine the results.
- The end time must be at least one hour in the past.
We do everything we can to make sure that everyone has the same chance of seeing their Historics queries running as soon as possible. We try to kick off every query within 2,000 seconds but it is impossible to make guarantees because performance depends on the load on our systems. Unless there is a serious problem, your job should certainly start within 24 hours. Visit datasift.com/status to check the health of the DataSift platform.
You can hit the /historics/get endpoint at any time to check the status of one of your Historics queries.
No, but we promise to start each and every Historics query that you submit as soon as we can. Currently we are working on an API endpoint that will return an estimate of the start time.
This depends on the load on DataSift when you run your query and on the duration and complexity of the query. Once a query has started, hit the /historics/get endpoint and set the with_estimate parameter to 1. DataSift will respond with our estimated time to completion for your query.
It also depends on the timeframe and sample size of your Historics query. The timeframe of the query is the duration between the start date and time, and the end date and time of the query. The sample size of the output data can be either 100 percent or 10 percent of all the available data.
When you create a Historics query, it needs to access our data archive in order to retrieve output data for a selected timeframe. Our data archive could become very busy when multiple Historics queries are running at the same time. If the data archive is running over capacity, your query will be queued; that is, it will have to wait for access until other queries accessing the data archive have been executed. Although the queuing process takes a little time, keep in mind that Historics queries, once they are running, retrieve data 30 times faster than a real-time filter.
Once your Historics query has access to our data archive, it then depends on the timeframe of your query and the sample size of the output data. A Historics query with a shorter timeframe and a sample size of 10 percent is likely to execute more quickly than a query with a longer timeframe and a sample size of 100 percent.
When your job starts it is probably running in parallel with jobs from other users. This brings many efficiency advantages.
If you run two or more jobs that examine the same time period in our archive, they might run in parallel; our job scheduler uses a set of real-time algorithms to ensure jobs run as quickly as possible.
If you submit Historics queries from two separate accounts they can run in parallel.
This is a very difficult question to answer accurately because it varies depending on the load. We'll continue to add hardware to keep the processing time as short as possible. Under normal load, a Historics query that examines one week of our archive might take six hours to run. If no other jobs are running at the same time we would expect it to take no less than 75 minutes.
Possibly because you're running a Historics query that examines many days of data. Or possibly due to heavy loading on the system, which can cause temporary delays.
With a single Historics query you can look at up to one month of our archived data. To look at a longer time period, use two or more Historics queries and concatenate the results.
Consult our Historics Archive Schema for more details.
First of all, if a Historics query fails before it runs to completion, any data collected so far is safe. Either it's already been delivered to your data destination or it's still in our Push buffer and it will be sent to you automatically. We divide your Historics query into one or more "chunks" and we run them in parallel to increase performance. You can hit /historics/get to discover which chunks have finished and which have not.
In the event of a problem we retry delivery several times and it is extremely rare to find cases where there is a failure. If you do experience a failure, you can create new Historics jobs to fill in the gaps. You might need to deduplicate the data and you will not be charged any DPUs for the chunks that did not run to completion because DPUs are calculated after successful delivery. However, you will be charged the license fees, if applicable, for any duplicate data because DataSift applies license costs immediately.
The problem might be with your Push destination rather than with the Historics query. In cases where you're expecting to receive data but none arrives, hit the /push/log endpoint to check the error log.
Yes. You will receive an email by default when a Historic job has completed. Check your notification settings.