Pull

Updated on Monday, 16 June, 2014 - 17:57

The Pull Connector was built to help customers who are behind firewalls and cannot open up public endpoints, but they still want to receive data using the Push delivery mechanism. It provides the following benefits:
 

  • No need to modify your firewall or network configuration.

    With Pull, there is no need to set up public endpoints. It simplifies firewall and network management on your side.
     
  • Familiar API.

    Pull uses our standard REST interface. If you already know it, you know how to set up and use Pull.
     
  • Data collection and processing at your own pace.

    Your data is stored for an hour in a Push queue. You can collect it as often as you want and you can request as much of it as you want.
     
  • Rewind, if necessary.

    If you need to request data again, you can go back in time for up to an hour.
     
  • Built-in Push resiliency.

    Because the Pull Connector is based on the same resilient platform used by other Push connectors, you still get all of the stability of the Push platform with the added flexibility of data collection.

 

Configuring Pull delivery

  1. To enable delivery, you will need to define a stream or a Historics query. Both return important details required for a Push subscription. A successful stream definition returns a hash, a Historics query returns an id. You will need either (but not both) to set the value of the hash or historic_id parameters in a call to /push/create. You need to make a call to /push/get or /historics/get to obtain that information or you can use the DataSift dashboard.
     
  2. Once you have the stream hash or the Historics id, you can supply that information to /push/create. In the example below we are making that call using curl, but you are free to use any programming language or tool.


    For more information, read the step-by-step guide to the API to learn how to use Push with DataSift's APIs.
     
  3. When a call to /push/create is successful, you will receive a response that contains a Push subscription id. You will need that information to make successful calls to all other Push API endpoints (/push/delete, /push/stop, and others). You can retrieve the list of your subscription ids with a call to /push/get.
     
  4. You should now check that the Pull connector has some data for you:


    Note the trick used to limit the number of interactions. By setting size to 1 byte you are telling DataSift to deliver at most one byte of data, which is interpreted as at most one byte up to the nearest interaction boundary. If at least one interaction is available, DataSift will deliver it in the format of your choice (see the output_params.format later on this page). This response looks like this:


     
    If no interactions are waiting in the queue but the Push subscription exists, DataSift will respond with the HTTP 204 code:
     
  5. Please wait for a minimum of one second before you make a call to /pull to collect data and for at least 500 milliseconds between two consecutive calls to  /pull using the same subscription ID. That means you can make no more than two requests per second for each subscription ID.

    If you don't receive any interactions this might be due to the fact that the stream has no data in it or there is a problem with your server's configuration. If you want more information, make a call to /push/log and check the value of the success field. If it is set to failure, check the value of the message field for clues. Also, make sure to make a call to /push/get and see if the response includes information about DataSift retrying to deliver data to your data delivery destination. When the status field is set to retrying, you should verify that your server is receiving data.

    Please make sure that you watch your usage and add funds to your account when it is running low. Also, stop any subscriptions that are no longer needed otherwise you will be charged for their usage. There is no need to delete them. You can can have as many stopped subscriptions as you like without paying for them. Remember that any subscriptions that were paused automatically due to insufficient funds, will resume when you add funds to your account.
     
  6. To stop delivery, call /push/stop. To remove your subscription completely, call /push/delete.
     
  7. Familiarize yourself with the output parameters.

 

Output parameters (Pull setup)

Parameter: Description:
output_params.format
optional
default = json_new_line
The output format for your data:
  • json_meta - Each payload contains a full JSON document. It contains metadata and an "interactions" property that has an array of interactions.
  • json_array - The payload is a full JSON document, but just has an array of interactions.
  • json_new_line - The current default format, where the payload is NOT a full JSON document. Each interaction is flattened and separated by a line break.

If you omit this parameter or set it to json_meta, your output consists of JSON metadata followed by a JSON array of interactions (wrapped in square brackets and separated by commas).

Take a look at our Sample Output for File-Based Connectors page.

If you select json_array, DataSift omits the metadata and sends just the array of interactions.

If you select json_new_line, DataSift omits the metadata and sends each interaction as a single JSON object.

 

 

Collecting data using Pull

When you want to collect the next batch of interactions, you need to call the /pull endpoint. If the subscription whose id you provided exists, the DataSift platform responds in one of two HTTP error codes:

  • 200 OK -- the response includes at least one interaction.
  • 204 No Content -- the subscription exists, but the queues are empty. You need to wait 500 milliseconds before you make another call using the same subscription id. (There is a limit of two requests per second per subscription ID).

 

When all goes well and you want to collect another batch of interactions, simply call /pull again. If, for some reasons you wan to retrieve the last batch of interactions again, you need to use the cursor parameter in your call to /pull. The value of the cursor parameter can be found in the response headers and you should implement a mechanism for capturing those headers. A 200 OK response will return two values:

  • X-DataSift-Cursor-Current -- the pointer to the beginning of the batch of interactions you requested.
  • X-DataSift-Cursor-Next -- the pointer to the beginning of the next batch of interactions.
     

There is a limit of two requests per second made with the same subscription ID.
 

To request a batch of interactions that you have a cursor for:

 

Output parameters (Pull data collection)

Parameter: Description:
cursor
optional
A pointer into the Push queue associated with the current Push subscription.
size
optional

The maximum amount of data that DataSift will send in a single batch. Can be any value from 1 byte through 52,428,800 bytes (50MB). DataSift may deliver slightly more data if the last interaction's boundary falls outside of the given value of the size parameter. If you omit this parameter, the default is 50MB.

 

Errors

When you use a cursor that points to a batch of interactions that no longer exist in the queue, you will receive error 400 Bad Request:

 

 

Notes

  • Twitter sends delete messages which identify Tweets that have been deleted. Under your licensing terms, you must process these delete messages and delete the corresponding Tweets from your storage.
     
  • Cursors may point to data that is no longer stored in the subscription queue.
Limitations: 

1. You are responsible for collecting data from the Push queues in a timely manner. If you loose data, you can request it again as long as it is still available in the Push queue.

2. There is a limit of two requests per second made with the same subscription ID.