HTTP

Updated on Tuesday, 5 August, 2014 - 14:17

As an alternative to HTTP Streaming and WebSockets, which require you to maintain a long-living socket connection with us, we offer a more robust and efficient way to deliver data to your servers using HTTP POST or PUT methods. In this scenario, DataSift acts like a user uploading a file to an HTTP server via a web browser. The interactions you filter for will be delivered in batches, in the JSON format. You will need to set up your own HTTP server and write code to handle the uploads. Your responsibility is to ensure that your HTTP server can handle the volume of data sent by DataSift.

Here are some guidelines to help you write code to use HTTP destinations successfully with Push. Read the step-by-step guide to the API to learn how to work with DataSift's APIs.

Configuring HTTP for Push delivery

When you choose to use your own HTTP server to receive interactions from DataSift you need to write code to handle communication with us properly. You are free to choose any technology and software you like, provided you follow the instructions below. For your convenience, we have also provided source code of a simple non-authenticating, non-SSL HTTP server that you can run on your side. It is designed to help you test this connector and you are free to use it and extend it, but it is not supported.

Authentication

The DataSift HTTP connector currently offers one HTTP authentication method: basic authentication. The alternative is to use no authentication. When you combine basic authentication with a firewall that only lets traffic from and to the range of IP addresses managed by DataSift and SSL (see the appropriate section later on this page), you will end up with a very secure communication channel.
 

SSL Support

For extra security, DataSift can deliver interactions over a secure SSL connection. You need to turn it on and let DataSift know if you want us to verify the validity of your SSL certificate. See the list of the output parameters.

POST or PUT

DataSift supports two HTTP methods for data delivery: POST and PUT. You can set the delivery method in the parameters of a /push/create call for the HTTP connector.
 

HTTP headers

Requests made by the DataSift's HTTP connector include additional headers. They contain useful information about the data sent to your HTTP server. You can use it to create unique filenames, database rows/tables, or content handlers.

Element Content
X-Datasift-Hash For a Historics query, this contains the Historics Id.
For a stream, this contains the stream hash.
X-Datasift-Hash-Type Either "historic" or "stream".
X-Datasift-Id The subscription id for this query.
Content-Encoding Set to "gzip" if the content is in GZIP format. If the content is not compressed, this header does not need to be present.

 

GZIP compression

The default data delivery format used by DataSift in uncompressed plain-text JSON. When your server is not capable of processing large amounts of data or when you do not have enough bandwidth, you should consider using compression. DataSift is happy to deliver compressed data. All it takes is adding another parameter in a /push/create call. Remember to store and uncompress the data you are receiving on your side.

Responding quickly

When receiving a batch of data, your server must respond with a success message within 10 seconds. Otherwise, the call will time out and the delivery will be considered a failure, and reattempted. Please make sure your code can process and store the data fast enough.

Handling POST requests

DataSift needs to make sure that the HTTP server to which it will try to send the data can accept the data. Your server must pass a simple test:

  1. The first thing that DataSift does with HTTP POST is to send an empty JSON object to your URL:


     
    The empty JSON object string will be sent in the body of the request. Your server-side code must recognize it and react accordingly (see the next step).
     
  2. Your server must send back a success message with a status code of 200 to 299, otherwise DataSift will issue an error message and will not send you data. The JSON success message looks like this:
      


     

There is no need to call /push/create repeatedly, you can use a call to /push/validate to send the empty JSON object and checks that your server sends the success message.

 

PUT

If you use PUT, you must send back a message with a status code of 200 to 299, but DataSift does not check the content of the message.

Testing Push with a test HTTP server

To help you test the HTTP connector we are giving you some code to play with. The following Python script implements a test HTTP server that you can use as a starting point for your own servers. It performs no authentication or SSL, but it does support compressed data delivery. Whatever data it receives will be written to files in the local directory. The files will use the DataSift--.json filename pattern for uncompressed data and DataSift--.gz for compressed data. You need Python 2.7 or later and a local installation of the Tornado HTTP server. Once you have both pieces of software installed, run the server:
 
python ./connector-http-server.py

 

Configuring Push for HTTP delivery

  1. To enable delivery, you will need to define a stream or a Historics query. Both return important details required for a Push subscription. A succesful stream definition returns a hash, a Historics query returns an id. You will need either (but not both) to set the value of the hash or historic_id parameters in a call to /push/create. You need to make a call to /push/get or /historics/get to obtain that information or you can use the DataSift dashboard. 
     
  2. Once you have the stream hash or the Historics id, you can supply that information to /push/create. In the example below we are making that call using curl, but you are free to use any programming language or tool.
     
  3. For more information, read the step-by-step guide to the API to learn how to use Push with DataSift's APIs.
     
  4. When a call to /push/create is successful, you will receive a response that contains a Push subscription id. You will need that information to make successful calls to all other Push API endpoints (/push/delete, /push/stop, and others). You can retrieve the list of your subscription ids with a call to /push/get.
     
  5. You should now check that the data is being delivered to your server. If you are using your own custom solution, you will know how to do it, but if you are using our test server, you need to log in to the machine running our test HTTP server and list the contents of /tmp:

    ls /tmp/DataSift*

    When DataSift is able to connect and deliver interactions to this directory, the test HTTP server will use filenames that follow the patterns described in the "Testing Push with a test HTTP server" section earlier on this page. Please remember that the earliest time you can expect the first data delivery is one second after the period of time specified in the output_params.delivery_frequency parameter. If there is a longer delay, this might be due to the fact that the stream has no data in it or there is a problem with your server's configuration. In the first case, preview your stream using the DataSift web console and in the second case, make a call to /push/log to see if that has any additional information.

    Please make sure that you watch your usage and add funds to your account when it is running low. Also, stop any subscriptions that are no longer needed otherwise you will be charged for their usage. There is no need to delete them. You can can have as many stopped subscriptions as you like without paying for them. Remember that any subscriptions that were paused automatically due to insufficient funds, will resume when you add funds to your account.
     
  6. To stop delivery, call /push/stop. To remove your subscription completely, call /push/delete.
     
  7. Familiarize yourself with the output parameters (for example, the host name) you'll need to know when you send data to an HTTP server.

 

Notes

Twitter sends delete messages which identify Tweets that have been deleted. Under your licensing terms, you must process these delete messages and delete the corresponding Tweets from your storage. Learn more...

 

Output parameters

DataSift sends JSON data once it becomes available, according to the "delivery_frequency" interval you configure in the output parameters. Data is bundled into batches of up to "max_size" bytes. Your server must respond with a success message upon each delivery.

Parameter: Description:
output_params.format
optional
default = json_meta
The output format for your data:
  • json_meta - The current default format, where each payload contains a full JSON document. It contains metadata and an "interactions" property that has an array of interactions.
  • json_array - The payload is a full JSON document, but just has an array of interactions.
  • json_new_line - The payload is NOT a full JSON document. Each interaction is flattened and separated by a line break.

If you omit this parameter or set it to json_meta, your output consists of JSON metadata followed by a JSON array of interactions (wrapped in square brackets and separated by commas).

Take a look at our Sample Output for File-Based Connectors page.

If you select json_array, DataSift omits the metadata and sends just the array of interactions.

If you select json_new_line, DataSift omits the metadata and sends each interaction as a single JSON object.

output_params.method
required
The verb that you want DataSift to use with the HTTP request:
  • POST
  • PUT
output_params.url
required

Any valid URL that you want DataSift to deliver to; for example:

    http://www.fromdatasift.com/destination/

 

For POST requests:

DataSift uses the URL that you specify.

 

For PUT requests:

DataSift appends a filename to the URL.

For example, suppose that you supply this URL:
     http://www.fromdatasift.com/destination/

 

Internally we append the filename in this format:  DataSift-<subscription_id>-<time>.json

 

When you hit the /push/create endpoint for the first time, we make a PUT request to http://www.fromdatasift.com/destination/DataSift-verify-31546216.json (where there isn't a subscription id created yet, so we add "verify" into the file name and 31546216 is the time of test).

 

When DataSift has data ready for delivery using a PUT request, it sends it to http://www.fromdatasift.com/destination/DataSift-abcdefghij1234579-31546216.json (where the abcdefghij1234579 is the subscription id and 31546216 is the time of delivery).

Make sure that the URL is properly encoded, otherwise your /push/create request will fail. This on-line tool can help you encode data for an HTTP request.

output_params.delivery_frequency
required

The minimum number of seconds you want DataSift to wait before sending data again:

Learn more...

output_params.max_size
required
The maximum amount of data that DataSift will send in a single batch:

 

  • 102400 (100KB)
  • 256000 (250KB)
  • 512000 (500KB)
  • 1048576 (1MB)
  • 2097152 (2MB)
  • 5242880 (5MB)
  • 10485760 (10MB)
  • 20971520 (20MB)
output_params.auth.type
required

 

The authentication that you want DataSift to use when connecting to output_params.url:

 

  • basic
  • none

If you choose basic authentication, you must supply output_params.auth.username and output_params.auth.password.

If you specify "none" for authentication, or if you do not include this parameter, DataSift does not check for a username or password.

output_params.verify_ssl
required

Specify whether or not you want DataSift to verify your SSL certificate, checking that it originates from a legitimate Certificate Authority. Can be:

  • true
  • false
output_params.compression
required

The compression setting that you want DataSift to use:

  • none
  • zlib
  • gzip

If you set this parameter to zlib, DataSift compresses the data using the ZLIB compression standard with the compression level 6. We also add an additional entry to the header if you choose zlib or gzip:

    Content-Encoding: gzip

output_params.auth.username
required
if output_params.auth.type = basic
The username for authentication.

 

output_params.auth.password
required
if output_params.auth.type = basic
The password for authentication.

 

Data format delivered: 

JSON document containing an array of JSON objects, each representing one DataSift interaction. Here's an example of the output.

Storage type: 

For each delivery, DataSift sends all the data that is available. It is up to you to configure and manage the HTTP server to handle the storage.

Limitations: 

Take care when you set the max_size and delivery_frequency output parameters. If the stream your Push subscription is based on generates data at a faster rate than you permit the delivery, the buffer will fill up until we reach the point where data may be discarded.