Amazon AWS S3 is scalable storage as a service. Typically, you might use it when you do not know how much storage you are going to need but you may need lots of it. Just like Amazon AWS DynamoDB, it saves you time and money, because it is hosted in the Amazon AWS cloud and you only pay for the resources you actually use. Unlike the Amazon AWS DynamoDB connector, the Amazon AWS S3 connector saves multiple interactions in JSON format in plain text files.

Configuring Amazon AWS S3 for Push delivery

To use Amazon AWS S3 with Push delivery, follow the instructions below, skipping the steps you have already completed. It does not matter which operating systems you use as long as you can connect to the internet:

  1. Create a new S3 bucket. (You need to have an Amazon AWS account.)

    There are two ways of doing this: programmatically or via a web browser. In the examples on this page we will use an S3 bucket called datasift-s3.
     
  2. Create a new folder inside the datasift-s3 bucket.

    There are two ways of doing this, too: programmatically or via a web browser. In the examples on this page we will use a folder called interactions.
     
  3. You are now ready to set up the Amazon AWS S3 connector.

 

Configuring Push for Amazon AWS S3 delivery

  1. To enable delivery, you will need to define a stream or a Historics query. Both return important details required for a Push subscription. A succesful stream definition returns a hash, a Historics query returns an id. You will need either (but not both) to set the value of the hash or historic_id parameters in a call to /push/create. You need to make a call to /push/get or /historics/get to obtain that information or you can use the DataSift dashboard.
     
  2. Once you have the stream hash or the Historics id, you can give that information to /push/create. In the example below we are making that call using curl, but you are free to use any programming language or tool.
     
  3. For more information, read the step-by-step guide to the API to learn how to use Push with DataSift's APIs.
     
  4. When a call to /push/create is successful, you will receive a response that contains a Push subscription id. You will need that information to make successful calls to all other Push API endpoints (/push/delete/push/stop, and others). You can retrieve the list of your subscription ids with a call to /push/get.
     
  5. You should now check that the data is being delivered to your Amazon AWS S3 bucket/folder. Log in to your AWS account and examine the contents of the datasift-s3/instances folder. When DataSift is able to connect and deliver interactions to this directory, it uses filenames that follow the patterns described in the output_params.file_prefix output parameter definition later on this page.

    Please remember that the earliest time you can expect the first data delivery is one second after the period of time specified in the output_params.delivery_frequency parameter. If there is a longer delay, this might be due to the fact that the stream has no data in it or there is a problem with your server's configuration. In the first case, preview your stream using the DataSift web console and in the second case, make a call to /push/log to find out if there are any clues in there.

    Please make sure that you watch your usage and add funds to your account when it is running low. Also, stop any subscriptions that are no longer needed otherwise you will be charged for their usage. There is no need to delete them. You can can have as many stopped subscriptions as you like without paying for them. Remember that any subscriptions that were paused automatically due to insufficient funds, will resume when you add funds to your account.
     
  6. To stop delivery, call /push/stop. To remove your subscription completely, call /push/delete.
     
  7. Familiarize yourself with the output parameters (for example, the bucket name) you'll need to know when you send data to an Amazon AWS S3 bucket.

 

IAM permissions

IAM permissions provide a way of an Amazon Web Services master user to delegate file and directory permissions to other users. For instance you might only want to allow a user access to the directory where they push their DataSift data. Here's an example that you can modify and submit to AWS:

 

The first two sections allow the user to list files in the "my_directory" directory inside the "bucket_name_here" bucket. The next two sections allow the user to put/move/delete files inside the "my_directory" directory inside the "bucket_name" bucket. 

In the user tab of the security credentials page in the Amazon Web Services web console you can create a new user: 

  1. Click Create New Users.
     
  2. Enter a username such as "restricted_credentials".
     
  3. Click Download credentials (this will give you a file containing the access_key and secret_key you put into your DataSift Destination.
     
  4. Back on the list of users check the box next to "restricted_credentials".
     
  5. Choose the Permissions tab at the bottom of the screen.
     
  6. Click Attach User Policy.
     
  7. Choose Custom Policy then Select.
     
  8. Type a Policy Name such as "restricted_policy".
     
  9. Paste the JSON shown above into the Policy Document box.

You're done! Now you can use the access_key and secret_key from the downloaded file and DataSift will push to S3 with the restricted and more secure credentials.

 

Notes

Twitter sends delete messages which identify Tweets that have been deleted. Under your licensing terms, you must process these delete messages and delete the corresponding Tweets from your storage.
 

Output parameters

Parameter: Description:
output_params.format
optional
default = json_meta
The output format for your data:
  • json_meta - The current default format, where each payload contains a full JSON document. It contains metadata and an "interactions" property that has an array of interactions.
  • json_array - The payload is a full JSON document, but just has an array of interactions.
  • json_new_line - The payload is NOT a full JSON document. Each interaction is flattened and separated by a line break.

If you omit this parameter or set it to json_meta, your output consists of JSON metadata followed by a JSON array of interactions (wrapped in square brackets and separated by commas).

Take a look at our Sample Output for File-Based Connectors page.

If you select json_array, DataSift omits the metadata and sends just the array of interactions.

If you select json_new_line, DataSift omits the metadata and sends each interaction as a single JSON object.

output_params.auth.access_key
required
The access key for the S3 account that DataSift will send to.

Make sure that this value is properly encoded, otherwise your /push/create request will fail.

Please create custom credentials to ensure that access to your Amazon S3 account is restricted.
output_params.auth.secret_key
required
The secret key for the S3 account that DataSift will send to.

Make sure that this value is properly encoded, otherwise your /push/create request will fail.

Please create custom credentials to ensure that access to your Amazon S3 account is restricted.
output_params.delivery_frequency
required

The minimum number of seconds you want DataSift to wait before sending data again:

Learn more...

output_params.max_size
required

 

The maximum amount of data that DataSift will send in a single batch:

  • 102400 (100KB)
  • 256000 (250KB)
  • 512000 (500KB)
  • 1048576 (1MB)
  • 2097152 (2MB)
  • 5242880 (5MB)
  • 10485760 (10MB)
  • 20971520 (20MB)
output_params.bucket
required
The bucket within that account into which DataSift will deposit the file.
output_params.directory
optional

An optional directory name in the bucket.

The directory must already exist.

output_params.acl
required

The access level of the file after it is uploaded to S3:

  • private (Owner-only read/write)
  • public-read (Owner read/write, public read)
  • public-read-write (Public read/write)
  • authenticated-read (Owner read/write, authenticated read)
  • bucket-owner-read (Bucket owner read)
  • bucket-owner-full-control (Bucket owner full control)
output_params.file_prefix
optional
default = DataSift

An optional prefix to the filename. Each time Datasift delivers a file, it constructs a name in this format:

    file_prefix + subscription id + timestamp.json

 

Data format delivered: 

JSON document.

Storage type: 

For each delivery, DataSift sends one file containing all the available interactions.

Limitations: 

Take care when you set the max_size and delivery_frequency output parameters. If your stream generates data at a faster rate than you permit the delivery, the buffer will fill up until we reach the point where data may be discarded.