ElasticSearch

Updated on Wednesday, 28 May, 2014 - 17:58

ElasticSearch is a schema-free, document-oriented search engine. It prides itself on being very easy to scale from one server to hundreds of servers. It is also nice to know that  ElasticSearch speaks HTTP and REST out of the box. DataSift can deliver interactions straight into your ElasticSearch server, but it is your responsibility to set it up.

Configuring ElasticSearch for Push delivery

To use ElasticSearch with Push, follow the instructions below, skipping the steps you have already completed. We use Debian and Ubuntu Linux distributions for our examples, but the principles apply to any operating system:

  1. Update your operating system. Refer to your system's specific instructions for this step. Debian and Ubuntu users can do it using:

    sudo apt-get update
     
  2. Install Java:

    apt-get install openjdk-6-jre-headless
     
  3. Download the .tar.gz sources of the latest stable version of ElasticSearch (replace xx.xx.xx with the latest stable version number) .

    wget https://github.com/downloads/elasticsearch/elasticsearch/elasticsearch-xx.xx.xx.tar.gz

    Please note that the link will change with each new version. For the latest version of ElasticSearch always go to the download page.
     
  4. Unpack the sources:

    tar zxvf elasticsearch-0.19.11.tar.gz
     
  5. Change the current working directory to the newly created ElasticSearch source directory:

    cd elasticsearch-0.19.11
     
  6. Start ElasticSearch:

    sudo bin/elasticsearch
     
  7. Try to reach your ElasticSearch server from the outside:

    curl http//elasticsearch.example.com:9200/

    The server should reply with a JSON document that contains a quote from "The Hitchhiker's Guide to the Galaxy". If you cannot reach your ElasticSearch server, check your firewall setings. The firewall must allow incoming TCP connections to port 9200. If problems persist, check that your firewall is letting outbound packets pass. Most stateful firewalls such as iptables and pf operate in that mode by default. For security block public access to that on your ElasticServer server and only open it for your own network and the connections from the IP address range managed by DataSift.
     
  8. You are now ready to set up the ElasticSearch connector.
     

Configuring Push for delivery to ElasticSearch

  1. To enable delivery, you will need to define a stream or a Historics query. Both return important details required for a Push subscription. A succesful stream definition returns a hash, a Historics query returns a Historics id. You will need either (but not both) to set the value of the hash or historic_id parameters in a call to /push/create. You need to make a call to /push/get or /historics/get to obtain that information or you can use the DataSift dashboard.
     
  2. Once you have the stream hash or the Historics id, you can give that information to /push/create. In the example below we are making that call using curl, but you are free to use any programming language or tool.

    For more information, read the step-by-step guide to the API to learn how to use Push with DataSift's APIs.
     
  3. When a call to /push/create is successful, you will get a response that contains a Push subscription id. You will need that information to make successful calls to all other Push API endpoints (/push/delete, /push/stop, and others) You can retrieve the list of your subscription ids with a call to /push/get.
     
  4. You should now check that the data is being delivered to your server. Log in and list the contents of the data delivery directory:

    curl http://elasticsearch.example.com:9200/es_index/es_type/_count?q=type:twitter

    Execute the command show above every couple of minutes and check they changes to the value of the count field. When DataSift is able to connect and deliver interactions to your ElasticSearch server, it will add each interation as a separate document, therefore, the value of the document count should increase over time, unless you are not filtering for Twitter messages.

    Please remember that there might be a small delay between the time you create a Push subscription and the time your ElasticsSearch server receives data. If there is a longer delay, this might be due to the fact that the stream has no data in it or there is a problem with your server's configuration. If want more information, make a call to /push/log and check the value of the success field, if it is set to failure, check the value of the message field for clues. Also, make sure to make a call to /push/get and see if the response includes information about DataSift retrying to deliver data to your data delivery destination. When the status field is set to retrying, you should verify that your server is receiving data.

    Please make sure that you watch your usage and add funds to your account when it is running low. Also, stop any subscriptions that are no longer needed otherwise you will be charged for their usage. There is no need to delete them. You can can have as many stopped subscriptions as you like without paying for them. Remember that any subscriptions that were paused automatically due to insufficient funds, will resume when you add funds to your account.
     
  5. To stop delivery, call /push/stop. To remove your subscription completely, call /push/delete.
     
  6. Familiarize yourself with the output parameters (for example, the host name) you'll need to know when you send data to an ElasticSearch server.

Notes

  • Twitter sends delete messages which identify Tweets that have been deleted. Under your licensing terms, you must process these delete messages and delete the corresponding Tweets from your storage. If you're using ElasticSearch, DataSift handles those messages for you and deletes the relevant Tweets automatically from your ElasticSearch database.
     
  • The ElasticSearch Connector does not support output throttling and it is your responsibility to ensure that your ElasticSearch server can cope with the amount of data sent to it by DataSift.

 

Responding quickly

When receiving a batch of data, your server must respond with a success message within 10 seconds. Otherwise, the call will time out and the delivery will be considered a failure, and reattempted. Please make sure your code can process and store the data fast enough.

Output parameters

Parameter: Description:
output_params.host
required
The host name of your ElasticSearch installation.
output_params.port
optional
default = 9200

You can specify a port or accept the default.

output_params.index
optional

The ElasticSearch index that you want to use.

If it does not exist, DataSift will create it for you and set its name to the Push subscription id.

Use a valid index name.

output_params.type
optional
default = interactions

The type that you want to use for the index.

If it does not exist, DataSift will add a new type for the index and use the new type.

Use a valid type name.

For more information on ElasticSearch mapping types, read the documentation.

 
output_params.format
optional
default = basic_interaction_meta
The output format for your data:
  • basic_interaction_meta - The current default format, where each payload contains only basic interaction JSON document.
  • full_interaction_meta - The payload is a full interaction with augmentations.

Take a look at our Sample Output for Database Connectors page.

 

Data format delivered: 

ElasticSearch native format (JSON). Each interaction is stored as one document.

Storage type: 

One document per interaction.

Limitations: 

By default, ElasticSearch limits objects to 100Mb, which is much larger than the ~1KB interactions we have seen so far. You can change it by adjusting the http.max_content_length parameter in the appropriate ElasticSearch configuration file.