Push: General Notes

Buffering

Our internal buffers hold up to 1 hour of data for you. For example, if your systems are temporarily out of service and you want to ensure that none of your data is lost, you can pause delivery. If you allow more than an hour to pass without resuming data delivery, your subscription will fail.

Take a look at our status codes to see the different states that the internal buffers can take.

Being responsive

If you're using a system such as Amazon DynamoDB, DataSift will handle the communication for you. If you're using a connector such as HTTP POST, DataSift gives you 10 seconds to respond. Make sure your code respects that limit.

Handling subscriptions

There is a limit of 200 active, retrying, or paused push subscriptions per account.

Creating your destination

Make sure that you have correctly configured the destination. For example, if you want to use HTTP POST to deliver data, you will need to create an endpoint that can accept POST requests and correctly handle the data when it reaches you.

Pinging

The Push endpoints automatically send a ping to verify they work when they are first created. The push/create endpoint runs this test automatically when it sends the data, to make sure that it is sending to the correct locations.

You can ping manually, by hitting the push/validate endpoint.

Setting the output parameters

You can use the output parameters for each connector. The parameters are used in the push/create, push/validate, and push/update endpoints for settings such as the minimum interval that DataSift will wait before it delivers the next package of data or the maximum size of each data package (or batch).

Of course, if the throughput of incoming data is low, your packages can be small or might arrive less frequently than you expect.

Conversely, if you set the package size to 1Mb and the minimum delivery interval to 30 seconds, then you are limiting yourself to 2Mb per minute. That might be sufficient for your stream but if there is a surge in the data volume for any reason, data could be backing up in the queue, and DataSift might deliver more frequently than your delivery_frequency limits. It is recommended that the max_size limit is high enough to allow the queued backlog to be delivered in time before it expires.

We recommend that you test your streams as you are developing them, to get an idea of the typical volume. But be aware that any topic you choose to filter against might unexpectedly turn into a trending topic with high volume.

Looking at the message logs

The push/log endpoint allows you to see DataSift's message and error logs. Make sure you keep an eye on those so that you can detect problems early and respond quickly to avoid data loss.

Backing off

Push has a generic 'back-off' process that it uses when delivery fails. We keep your data safe by buffering it. In the event of a problem, we'll try to resend up to five times immediately. Then, we'll retry every minute for 60 minutes. Once the connection is restored, we'll revert to the original delivery frequency. To determine why a failure occurred, hit the push/log endpoint.

If we can't connect to your data destination after trying for one hour, we place your subscription into a non-recoverable failed state.

Monitoring status

Make sure you monitor the status of a 'push' subscription using the push/get endpoint. If you're using one of our client libraries, it does this for you automatically.

Pausing and resuming

During outages you can deliberately pause data delivery for up to 1 hour.

Handling delete and status messages

As well as the typical data that Push handles, it also carries other objects such as status messages and delete messages. For example, Twitter delete messages identify Tweets that have been deleted. Under the terms of your license, you are obliged to process these. For example, if you receive and store a Tweet and subsequently receive a delete message for that Tweet, you must delete it from your storage.

Where possible, we handle the deletes for you. For example, if you are using Push to store data in a DynamoDB database, we can locate the Tweet and delete if for you. However, in cases where we cannot access the data (if you are using FTP or HTTP, for example) you must handle the delete messages.

Deduplication

In some situations you might receive duplicate data. For example, when we send data to you, we keep it safely buffered until you acknowledge receipt. If, for any reason, your client does not send an acknowledgement, we will resend the data to make sure it reaches you.

In cases where deduplication is important for your particular application, we recommend that you use the interaction_id element in the output to identify duplicates and create scripts to eliminate them.