This connector is available only to customers on the Enterprise Edition.

 

Google Cloud Platform is an Infrastructure as a Service (IaaS) platform. It comprises multiple services for hosting applications, storing data, and computing. Google BigQuery is the Google Cloud Platform service for querying billions of rows of data within tables.

Google BigQuery is a web service and REST API. When using BigQuery as a data destination, streams of interactions are queued and sent at 90 second intervals to BigQuery tables. Each interaction is a new table row.

 

Creating a BigQuery project

To use Google BigQuery with Push delivery, follow the instructions below, skipping any steps you may already have completed. It does not matter which operating system you use as long as you can connect to the internet:

  1. Go to the Developer's Console and log in using existing credentials or create a new account.
     
  2. Create a new Google API project or use an existing one.

    Projects are the top-level container for all Google Cloud Platform services. The console opens on a pre-configured default API Project.
     
  3. To create a new project, click the API Project link which returns to the top-level menu.
     
  4. Click the CREATE PROJECT button.
     
  5. When using a new account or using an existing account with Google Cloud services for the first time, SMS verification may be required. Click Continue and follow the instructions to receive a code number on a mobile phone which must be entered into the web page.
     
    When SMS verification is complete, enter a Project name. The Project IDs are generated automatically. Click the refresh button in the Project ID field to create new IDs. Click Create to continue.
     
    From the project overview page, make note of the Project ID and Project Number. Both are required when configuring the DataSift platform to use BigQuery as a destination.
 

Enabling billing

You can store a small amount of data in a Cloud Datastore with BigQuery for no charge. However, this is not available until billing information has been completed.

  1. Complete billing information by going to the project summary and clicking Settings.
     
  2. Complete the billing information form. When complete the Enable billing button turns into a Disable billing button.

For guidance on how to add your Google Account to your company's business account, consult the Google developer documentation.

 

Retrieving authentication details

DataSift requires authentication credentials to send data to a BigQuery table. This is done with public and private keys.

  1. To generate the credentials and public/private keys, click APIs & auth then Credentials from the project summary.
     
  2. Click the CREATE NEW CLIENT ID button.
     
  3. Select Service account and click Create client ID.

    The new public and private keys are created. The private key is automatically downloaded with a .p12 filename extension.
     
  4. The private key password is also displayed. Make a note of the password as it is not shown again. When ready, click Okay, got it.
     
  5. The new credentials are displayed. Make a note (or download the JSON object) of the Client ID and Email address.
     
  6. The following credentials are required to configure BigQuery as a data destination:

    Client ID
    Email address
    p12 key file

 

Configuring datasets

  1. From the project summary page, click the BigQuery link. A new browser window opens with the BigQuery cloud.
     
  2. The BigQuery console opens with example datasets. A new dataset must be configured which is used by the DataSift platform when sending the stream of interactions. The stream automatically creates a new table in the dataset.
     
  3. Click the project menu and select Create new dataset.
     
  4. Enter a Dataset ID using only letters, numbers, and underscore.

 

Configuring the DataSift web application for BigQuery 

The dataset and credential information created in the previous section are used to configure the BigQuery destination in the DataSift platform. This section looks at the web application configuration.

 

Configuring a BigQuery Destination

  1. Open the DataSift platform web application by logging in at datasift.com. From the Data Destinations page, click the '+' symbol on the Google BigQuery tile.
     
  2. Complete the New Google BigQuery Destination form with the following information:
Label This name is only used in the web application. It is possible to define multiple BigQuery destinations to different projects, datasets and tables. Use this name to differentiate multiple instances.
Table ID The name of a table which is created automatically in the chosen dataset.
Dataset ID The dataset in which to create a table. This must exist in BigQuery.
Project ID The project number.
Client ID The client ID generated by the creation of new service account credentials.
Service account email address The email address generated by the creation of new service account credentials.
P12 Key file The private key file which was automatically downloaded by creation of new service account credentials.

The new destination appears in My Destinations. Notice how multiple instances of BigQuery destinations are referenced by their label.

 

Configuring stream tasks using BigQuery

  1. From the Streams page, create or select a stream to use with the Google BigQuery destination.

    A task is created using a live stream recording or historic data. In this example, a live recording is used. Select Record Stream from the stream summary.
     
  2. Configure start & end times and give the recording a name. Click Continue.
     
  3. Select a destination. The new BigQuery destination is available for selection. Click Continue.
     
  4. Check the details and confirm by clicking Start Task.
     
  5. The stream sends interactions to Google BigQuery which automatically creates a table using the name provided in the destination configuration. In this example, the Starbucks_Table1 table has been created and the schema is displayed.
     
  6. On the Table Details page, click the Details button to see the size of the table in bytes and rows.
     
  7. If an end date and time was not specified when configuring the task, remember to Pause or Stop the recording task in the DataSift web application when enough records have been received.

 

Configuring Push for BigQuery delivery

  1. To enable delivery, you will need to define a stream or a Historics query. Both return important details required for a Push subscription. A succesful stream definition returns a hash, a Historics query returns an id. You will need either (but not both) to set the value of the hash or historic_id parameters in a call to /push/create.
     
  2. Once you have the stream hash or the Historics id, you can give that information to /push/create. In the example below we are making that call using curl, but you are free to use any programming language or tool.

    Make sure that the P12 key file is Base64-encoded and then URL-encoded in order to remove URL-unsafe characters. This can be done with the following Python script:

    For more information, read the step-by-step guide to the API to learn how to use it with DataSift's APIs.
     
  3. When a call to /push/create is successful, you will receive a response that contains a Push subscription id. You will need that information to make successful calls to all the other Push API endpoints (/push/delete, /push/stop, and others) You can retrieve the list of your subscription ids with a call to /push/get.
     
  4. You should now check that the data is being delivered to your Google BigQuery dataset. Log in to your Google API account and switch to the correct project.



    Once you are in the correct project, examine the table.



    When you click on the table name (interactions in this example) you will see a query field and a long list of fields you can filter for.



    Please note that you can only use leaf fields for filtering. A leaf field is a field that has no children; for example: interaction.content is a leaf field but interaction is not.
     
  5. When DataSift is able to connect and deliver interactions to your Google BigQuery dataset, it will populate the table listed as the value of the output_params.table_id output parameter. Each interaction is stored in a separate row.

    Please remember that there may be a delay before you receive data. If the delay is unexpectedly long, this might be due to the fact that the stream has no data in it or there is a problem with your server's configuration. In the first case, preview your stream using the DataSift web console and in the second case, make a call to /push/log to find out if there are any clues in there.

    Please make sure that you watch your usage and add funds to your account when it is running low. Also, stop any subscriptions that are no longer needed otherwise you will be charged for their usage. Remember that any subscriptions that were paused due to insufficient funds will resume when you add funds to your account. Keep also in mind the fact that Google will bill you for your BigQuery usage separately.
     
  6. To stop delivery, call /push/stop. To remove your subscription completely, call /push/delete.
     
  7. Familiarize yourself with the output parameters (for example, the client_id) you'll need to know when you send data to an Google BigQuery dataset.
     

Notes

  • Twitter sends delete messages which identify Tweets that have been deleted. Under your licensing terms, you must process these delete messages. Learn more...
     
  • We buffer your data inside DataSift and periodically write it into your Google BigQuery dataset. We control the throughput and our default throughput rate is 19.5GB per day in chunks of up to 20MB delivered every 90 seconds.
     
  • Google will charge you for BigQuery storage and usage. Consult Google BigQuery documentation for pricing information.
     

Output parameters

output_params.auth.client_id
required

Your Google API Access Client ID for the project you want to push interactions into. You can find it in the API Access tab on your Google API account page for the project you are working with. It starts with a unique project id and ends with .apps.googleusercontent.com.

output_params.auth.service_account
required

The email address created for the service account for the project you want to push interactions into. You can find it in the API Access tab on your Google API account page for the project you are working with. It starts with the numeric id and ends with @developer.gserviceaccount.com.

output_params.auth.key_file
required

A Base64-encoded, URL-encoded contents of a PCKS12 key file generated using the Google API Console.

output_params.project_id
required

The numeric project ID generated for you by Google. You can find it in the Overview tab for your project in the Google API Console. Look for Project Number.

output_params.dataset_id
required

The dataset id that you want to use to store your tables. The dataset must exist before you create a Push subscription to the Google BigQuery connector. Its name may only contain letters, numbers, and underscores.

output_params.table_id
optional

A unique name of the table you want to use to store interactions. If the table does not exist, DataSift will create it for you. If the table does exist but has an incorrect schema, the DataSift API will respond with an error message. The table name may only contain letters, digits, and underscores and must not start with a digit or an underscore.

This parameter is optional but if you omit it you must supply output_params.schema.

output_params.schema
optional

The full mapping file as a string. You can use base64 encoding if you want to.

This parameter is optional but if you omit it you must supply output_params.table_id.

 

Creating a .INI schema file

Optionally, the Google BigQuery Push connector allows you to create a .INI file that defines the mapping from the JSON elements in interactions coming from DataSift to the tables and columns that you plan to use in your database.

To use a .INI file, specify the output_params.schema parameter and omit the output_params.table_id parameter.

Otherwise, specify the output_params.table_id parameter and omit the output_params.schema parameter.

Read our Writing .INI Files page to learn about .INI files.

 

Technical summary

Data format delivered: 

Google BigQuery native format. Each interaction is stored as one table row.

Storage type: 

For each delivery, DataSift sends one file containing all the available interactions.

Limitations: 

Google BigQuery import limits and your Google BigQuery quotas.