Amazon S3

We have also set up an Amazon S3 bucket where we will push files that comprise of batches of compliance events. This bucket will also contain a small number of historical interactions.

Bucket structure

Each of our public social firehoses interaction types have their own directory at the top level. Within each of these top level directories, we have a directory for each source, year (yyyy), month (mm) and then day (dd). For example:

Disqus:

         2018
            04
                29
                30
            05
                01
                02

Wordpress:

        2018
            04
                29
                30
            05
                01
                02

This facilitates ease of processing at any point in time. The day refers to the UTC date at which we push the data into S3. This is not the date of the interaction so you may need to process either size of day boundaries.

Within each day directory, we push batches of new-line delimited JSON files. This is the same technique we use for our Push delivery system. We will deliver multiples files each day. A file will not be modified after it has been uploaded to the S3 bucket. This allows customers to read this data at the same time, whilst we push new compliance messages to the feed close to real-time.

An example Wordpress compliance file for 25th May 2018 looks like this:

/wordpress/2018/05/25/deletes-12345678.json

With the content like this:

For self-hosted Wordpress blogs, there might be a few extra content types like jetpack-portfolio or jetpack-testimonial (see jetpack documentation).
Delete events for these content types should all have the same structure, as documented in the following example:

The bucket data is available on-request by contacting [email protected]. You can use the Amazon SDK tools to read this data, or manually consume it using a UI tool.