ElasticSearch is a schema-free, document-oriented search engine. It is very easy to scale from one server to hundreds. DataSift can deliver interactions straight into your ElasticSearch server.
Configuring ElasticSearch for Push delivery
To use ElasticSearch with Push, follow the instructions below, skipping the steps you have already completed. We use Debian and Ubuntu Linux distributions for our examples, but the principles apply to any operating system:
Update your operating system. Refer to your system's specific instructions for this step. Debian and Ubuntu users can do it using:
sudo apt-get update
apt-get install openjdk-6-jre-headless
Download the .tar.gz sources of the latest stable version of ElasticSearch (replace xx.xx.xx with the latest stable version number) .
Please note that the link will change with each new version. For the latest version of ElasticSearch always go to the download page.
Unpack the sources:
tar zxvf elasticsearch-0.19.11.tar.gz
Change the current working directory to the newly created ElasticSearch source directory:
Try to reach your ElasticSearch server from the outside:
The server should reply with a JSON document that contains a quote from "The Hitchhiker's Guide to the Galaxy". If you cannot reach your ElasticSearch server, check your firewall setings. The firewall must allow incoming TCP connections to port 9200. If problems persist, check that your firewall is letting outbound packets pass. Most stateful firewalls such as iptables and pf operate in that mode by default. For security block public access to that on your ElasticServer server and only open it for your own network and the connections from the IP address range managed by DataSift.
- You are now ready to set up the ElasticSearch connector.
Configuring Push for delivery to ElasticSearch
To enable delivery, you will need to define a stream or a Historics query. Both return important details required for a Push subscription. A succesful stream definition returns a hash, a Historics query returns a Historics id. You will need either (but not both) to set the value of the hash or historic_id parameters in a call to /push/create. You need to make a call to /push/get or /historics/get to obtain that information or you can use the DataSift dashboard.
Once you have the stream hash or the Historics id, you can give that information to /push/create. In the example below we are making that call using curl, but you are free to use any programming language or tool.
curl -X POST 'https://api.datasift.com/v1.4/push/create' \ -d 'name=connectorelasticsearch' \ -d 'hash=42d388f8b1db997faaf7dab487f11290' \ -d 'output_type=elasticsearch' \ -d 'output_params.host=elasticsearch.example.com' \ -d 'output_params.port=9200' \ -d 'output_params.index=es_index' \ -d 'output_params.type=es_type' \ -H 'Authorization: datasift-user:your-datasift-api-key'
For more information, read the step-by-step guide to the API to learn how to use Push with DataSift's APIs.
When a call to /push/create is successful, you will get a response that contains a Push subscription id. You will need that information to make successful calls to all other Push API endpoints (/push/delete, /push/stop, and others) You can retrieve the list of your subscription ids with a call to /push/get.
You should now check that the data is being delivered to your server. Log in and list the contents of the data delivery directory:
Execute the command show above every couple of minutes and check they changes to the value of the count field. When DataSift is able to connect and deliver interactions to your ElasticSearch server, it will add each interation as a separate document, therefore, the value of the document count should increase over time, unless you are not filtering for Twitter messages.
Please remember that there might be a small delay between the time you create a Push subscription and the time your ElasticsSearch server receives data. If there is a longer delay, this might be due to the fact that the stream has no data in it or there is a problem with your server's configuration. If want more information, make a call to /push/log and check the value of the success field, if it is set to failure, check the value of the message field for clues. Also, make sure to make a call to /push/get and see if the response includes information about DataSift retrying to deliver data to your data delivery destination. When the status field is set to retrying, you should verify that your server is receiving data.
Please make sure that you watch your usage and add funds to your account when it is running low. Also, stop any subscriptions that are no longer needed otherwise you will be charged for their usage. There is no need to delete them. You can can have as many stopped subscriptions as you like without paying for them. Remember that any subscriptions that were paused automatically due to insufficient funds, will resume when you add funds to your account.
- Familiarize yourself with the output parameters (for example, the host name) you'll need to know when you send data to an ElasticSearch server.
- The ElasticSearch Connector does not support output throttling and it is your responsibility to ensure that your ElasticSearch server can cope with the amount of data sent to it by DataSift.
When receiving a batch of data, your server must respond with a success message within 10 seconds. Otherwise, the call will time out and the delivery will be considered a failure, and reattempted. Please make sure your code can process and store the data fast enough.
|The host name of your ElasticSearch installation.|
default = 9200
You can specify a port or accept the default.
The ElasticSearch index that you want to use.
If it does not exist, DataSift will create it for you and set its name to the Push subscription id.
Use a valid index name.
default = interactions
The type that you want to use for the index.
If it does not exist, DataSift will add a new type for the index and use the new type.
Use a valid type name.
For more information on ElasticSearch mapping types, read the documentation.
default = basic_interaction_meta
| The output format for your data:
Take a look at our Sample Output for Database Connectors page.
Data format delivered:
ElasticSearch native format (JSON). Each interaction is stored as one document.
One document per interaction.
By default, ElasticSearch limits objects to 100Mb, which is much larger than the ~1KB interactions we have seen so far. You can change it by adjusting the http.max_content_length parameter in the appropriate ElasticSearch configuration file.
Please refer to the ElasticSearch documentation.