CouchDB

Updated on Wednesday, 28 May, 2014 - 17:53

CouchDB is a NoSQL document-based database that supports replication and versioning. Documents in the world of CouchDB are JSON objects (strings) that can store any type of data. In CouchDB they do not have to adhere to the same schema; in fact, each document can have a different internal structure if that is what the user wants. This makes CouchDB a natural interaction store, as all DataSift interactions are JSON objects. Another useful feature of CouchDB is its API. All operations are performed using a REST API, which follows the same design principles as the DataSift REST API. Finally, the CouchDB query language uses MapReduce and JavaScript. All those technologies, formats, and protocols are familiar to anyone who has at least some experience writing web applications. 

Please be aware of the fact that CouchDB is an excellent database to use when your application is read intensive, that is, when it reads more data from a database than you write to it, but it needs extra care and planning when you are writing a lot of data to it. One way to overcome that problem is to use the CouchDB batch mode, the other is to use bulk document inserts. None of them are currently implemented, because DataSift needs to be sure that each interaction has been delivered to its destination. If DataSift does not have that information, it will try to deliver the undelivered interaction again.

Configuring CouchDB for Push delivery

To set up and use CouchDB for Push delivery, follow the instructions below, skipping the steps you have already completed. We are going to use Debian and Ubuntu Linux distributions for our examples, but the principles apply to any operating system:

  1. Update your operating system. Refer to your system's specific instructions for this step. Debian and Ubuntu users can do it using:

    sudo apt-get update
     
  2. Install CouchDB on your system. Make sure that you use CouchDB 1.3.x or later. The exact commands used for that purpose will differ from one operating system to another. Debian and Ubuntu users will issue the following command:

    sudo apt-get install couchdb

    Unless you want to build a custom installation of CouchDB, do not build CouchDB from the sources. Use system-specific CouchDB packages instead. For more help, refer to your system's documentation and the CouchDB installation guide.

    If the installation fails, check that your OS update was successful and try again.
     
  3. Add the following lines to /etc/couchdb/local.ini in the [httpd] section:

    port = 5984
    bind_address = 0.0.0.0
     
  4. Save the changes and restart CouchDB:

    sudo service couchdb stop && sudo service couchdb start
     
  5. Test that CouchDB is working and that you can reach it from the outside. You can do that with curl on a computer connected to the public internet:

    curl couchdb.example.com:5984

    The server should respond with a message similar to the one shown below:

    {"couchdb":"Welcome","version":"1.3.0"}

    If curl hangs or if the connection is rejected, check your firewall setings. The firewall must allow incoming TCP connections to port 5984. If problems persist check that your firewall is letting outbound packets pass. Most stateful firewalls such as iptables and pf operate in that mode by default. For extra security, block public access to your SFTP server and only open it for your own network and the connections from the IP address range managed by DataSift.
     
  6. Create a CouchDB admin user account using a name of your choice:

    curl -X PUT couchdb.example.com:5984/_config/admins/dsreceiver -d '"dsrpassword"'

    In this example, our username is dsreceiver and his password is dsrpassword. You will use this user's credentials to connect to your CouchDB instance via the DataSift CouchDB Connector. For more guidance, read the security notes.
     
  7. Create a database. This step is essential. If you fail to create a database, DataSift will not be able to initialize this connector.

    curl -X PUT http://dsreceiver:dsrpassword@couchdb.example.com:5984/datasift_couchdb
     
  8. You are now ready to set up the CouchDB connector.
     

Configuring Push for delivery to CouchDB

  1. To enable delivery, you will need to define a stream or a Historics query. Both return important details required for a Push subscription. A succesful stream definition returns a hash, a Historics query returns an id. You will need either (but not both) to set the value of the hash or historic_id parameters in a call to /push/create. You need to make a call to /push/get or /historics/get to obtain that information or you can use the DataSift dashboard.
     
  2. Once you have the stream hash or the Historics id, you can give that information to /push/create. In the example below we are making that call using curl, but you are free to use any programming language or tool.


    For more information, read the step-by-step guide to the API to learn how to use Push with DataSift's APIs.
     
  3. When a call to /push/create is successful, you will get a response that contains a Push subscription id. You will need that information to make successful calls to all other Push API endpoints (/push/delete, /push/stop, and others) You can retrieve the list of your subscription ids with a call to /push/get.
     
  4. You should now check that the data is being delivered to your server. One way to do it is to watch changes to the doc_count parameter:

    curl http://couchdb.example.com:5984/datasift_couchdb/
    {"db_name":"datasift_couchdb",
    "doc_count":1223,
    "doc_del_count":0,"update_seq":1223,"purge_seq":0,"compact_running":false,"disk_size":5906530,"instance_start_time":"1352388385481768","disk_format_version":5,"committed_update_seq":1223}


    When DataSift is able to connect and deliver interactions, you will see the doc_count value increase as new documents are added over time. Each interaction is a separate document.

    If you want to see what is being added you have to know the id of each document. CouchDB lets you browse its databases using a web browser. You can connect to the CouchDB Futon interface by going to:

    http://couchdb.example.com:5984/_utils/

    Please note that there may be a delay between the time you create your Push subscription and the time interactions are added to your CouchDB database. This might be due to the fact that the stream has no data in it or there is a problem with your server's configuration. If you want more information, make a call to /push/log and check the value of the success field, if it is set to failure, check the value of the message field for clues. Also, make sure to call to /push/get and see if the response includes information about DataSift retrying to deliver data to your data delivery destination. When the status field is set to retrying, you should verify that your server is receiving data.

    Please make sure that you watch your usage and add funds to your account when it is running low. Also, stop any subscriptions that are no longer needed otherwise you will be charged for their usage. There is no need to delete them. You can can have as many stopped subscriptions as you like without paying for them. Remember that any subscriptions that were paused automatically due to insufficient funds, will resume when you add funds to your account.
     
  5. To stop delivery, call /push/stop. To remove your subscription completely, call /push/delete.
     
  6. Familiarize yourself with the output parameters (for example, the host name) you'll need to know when you send data to a CouchDB server.
     

Notes

  1. Twitter sends delete messages which identify Tweets that have been deleted. Under your licensing terms, you must process these delete messages and delete the corresponding Tweets from your storage. If you're using CouchDB, DataSift handles those messages for you and deletes the relevant Tweets automatically from your CouchDB database.
     
  2. The CouchDB Connector does not support output throttling and it is your responsibility to ensure that your CouchDB server can cope with the amount of data sent to it by DataSift.
     
  3. If you want to use SSL with CouchDB you need to use CouchDB 1.3.x. and read this document.

 

Responding quickly

When receiving a batch of data, your server must respond with a success message within 10 seconds. Otherwise, the call will time out and the delivery will be considered a failure, and reattempted. Please make sure your code can process and store the data fast enough.

 

Output parameters

 

Parameter: Description:
output_params.host
required
The host name of your CouchDB installation.
output_params.port
optional
default = 5984

You can specify a port or accept the default.

output_params.db_name
required
The name of an existing database.
output_params.auth.username
required

The username for authorization.

output_params.auth.password
required

The password for authorization.

output_params.use_ssl
optional
default = yes

Whether SSL should be used when connecting to the database. Possible values are:

  • yes
  • no
output_params.verify_ssl
optional
default = yes

If you are using SSL to connect, this specifies whether the certificate should be verified. Useful when a client has a self-signed certificate for development. Possible values are:

  • yes
  • no
output_params.format
optional
default = basic_interaction_meta
The output format for your data:
  • basic_interaction_meta - The current default format, where each payload contains only basic interaction JSON document.
  • full_interaction_meta - The payload is a full interaction with augmentations.

Take a look at our Sample Output for Database Connectors page.

 

Data format delivered: 

CouchDB native format (JSON). Each interaction is stored as one document.

Storage type: 

One document per interaction.

Limitations: 

There is no set limit to the number of values or amount of data that columns can hold.