MongoDB

Updated on Wednesday, 28 May, 2014 - 18:13

MongoDB is a NoSQL, document-based, scalable, high-performance, open source database popular in BigData applications. Documents in the world of MongoDB are JSON objecs (strings) that can store any type of data and in MongoDB they do not have to adhere to the same schema, each document can have a different internal structure if that is what the user wants. Documents that do follow the same structural pattern can be grouped into collections, but that is not a requirement.

Configuring MongoDB for Push delivery

To set up and use MongoDB for Push delivery, follow the instructions below, skipping the steps you have already completed. We are going to use Debian and Ubuntu Linux distributions for our examples, but the principles apply to any operating system:

  1. Install an operating system on your machine.
     
  2. Update the system. Refer to your system's specific instructions for this step. Debian and Ubuntu users can do it using:

    sudo apt-get update
     
  3. Install MongoDB on your system. The exact commands used for that purpose will differ from one operating system to another. Debian and Ubuntu users will issue the following command:

    sudo apt-get install mongodb

    Unless you want to build a custom installation of MongoDB, do not build MongoDB from the sources, use system-specific MongoDB packages instead. For more help, refer to your system's documentation and the MongoDB installation guide.

    If the installation fails, update the operating system (see the previous step) and try again.
     
  4. Change the bind_ip option in /etc/mongodb.conf to:

    bind_ip = 0.0.0.0
     
  5. Save the changes and restart MongoDB:

    sudo service mongodb stop && sudo service mongodb start
     
  6. Connect to the MongoDB server:

    mongo
     
  7. Create a new database:

    use datasiftmongodb
     
  8. Add some data to the database so it does not disappear when you quit mongo:

    a = { "name": "adam" }
    db.abc.save(a);
    db.abc.find();

     
  9. Quit mongo:

    exit
     
  10. Test that MongoDB is working and that you can reach it from the outside. You can do that with telnet on a computer connected to the public internet:

    telnet mongodb.example.com 27017

    The server should respond with:

    Escape character is '^]'.

    Press Ctrl+C to close this connection.
     
  11. You are now ready to set up the MongoDB connector.
     

Configuring Push for delivery to MongoDB

  1. To enable delivery, you will need to define a stream or a Historics query. Both return important details required for a Push subscription. A succesful stream definition returns a hash, a Historics query returns an id. You will need either (but not both) to set the value of the hash or historic_id parameters in a call to /push/create. You need to make a call to /push/get or /historics/get to obtain that information or you can use the DataSift dashboard.
     
  2. Once you have the stream hash or the Historics id, you can give that information to /push/create. In the example below we are making that call using curl, but you are free to use any programming language or tool.

    For more information, read the step-by-step guide to the API to learn how to use Push with DataSift's APIs.
     
  3. When a call to /push/create is successful, you will get a response that contains a Push subscription id. You will need that information to make successful calls to all other Push API endpoints (/push/delete, /push/stop, and others) You can retrieve the list of your subscription ids with a call to /push/get.
     
  4. You should now check that the data is being delivered to your server. Log in to the machine running your MongoDB server and use the mongo command to switch to the datasiftmongodb database and list collections created inside that database:

    mongo
    use datasiftmongodb
    show collections


    You should see there one or more collections whose names start with DataSift_ followed by the Push subscription id. You can display vital statistics for a collection using the following command:

    db.DataSift_<subscription_id>.stats()

    You will get a JSON dictionary whose count parameter will give the current document count for the collection. Issue that command a few times over a couple of minutes and see if it increases.

    If there is a longer delay with no new documents being added to the collection, this might be due to the fact that the stream has no data in it or there is a problem with your server's configuration. In the first case, preview your stream using the DataSift web console and in the second case, make a call to /push/log to find out if there are any clues in there.

    Please make sure that you watch your usage and add funds to your account when it is running low. Also, stop any subscriptions that are no longer needed otherwise you will be charged for their usage. There is no need to delete them. You can can have as many stopped subscriptions as you like without paying for them. Remember that any subscriptions that were paused automatically due to insufficient funds, will resume when you add funds to your account.
     
  5. To stop delivery, call /push/stop. To remove your subscription completely, call /push/delete.
     
  6. Familiarize yourself with the output parameters (for example, the host name) you'll need to know when you send data to a MongoDB server.
     

Notes

  1. Twitter sends delete messages which identify Tweets that have been deleted. Under your licensing terms, you must process these delete messages and delete the corresponding Tweets from your storage. If you're using MongoDB, DataSift handles those messages for you and deletes the relevant Tweets automatically from your MongoDB database.
     
  2. MongoDB stores data in "collections" which are analogous to tables in a relational database. DataSift creates a new collection for each new subscription. In other words, each time you you hit the /push/create endpoint and run it with the MongoDB destination, DataSift creates a new collection, and names it after that subscription id.
     
  3. MongoDB automatically indexes the database for you on the interaction id, which uniquely identifies an interaction.
     
  4. Do not apply unique indexes to any MongoDB field because we use a bulk insert process that does not provide an automatic way to ignore duplicates. If you require unique indexes, consider first performing manual deduplication and then moving the data into a separate collection/database; you are free to add indexes there.
     

Responding quickly

When receiving a batch of data, your server must respond with a success message within 10 seconds. Otherwise, the call will time out and the delivery will be considered a failure, and reattempted. Please make sure your code can process and store the data fast enough.

 

Output parameters

Parameter: Description:
output_params.host
required
The host name of your MongoDB installation.
output_params.port
optional
default = 27017

You can specify a port or accept the default.

output_params.db_name
required
The name of an existing database.
output_params.collection_name
optional

Optional collection name. When not specified, DataSift will set the name to DataSift_<subscription_id>. For example, 'DataSift_737c7b5f6f19e49f937356275dfd1a79'

output_params.auth.username
required

The username for authorization.

output_params.auth.password
required
The password for authorization.
 
output_params.use_ssl
optional
default = yes

Whether SSL should be used when connecting to the database. Possible values are:

  • yes
  • no
output_params.verify_ssl
optional
default = yes

If you are using SSL to connect, this specifies whether the certificate should be verified. Useful when a client has a self-signed certificate for development. Possible values are:

  • yes
  • no
output_params.format
optional
default = full_interaction_meta
The output format for your data:
  • basic_interaction_meta - Each payload contains only basic interaction JSON document.
  • full_interaction_meta - The current default format, where payload is a full interaction with augmentations.
  • full_interaction_meta_date - Each payload is a full interaction with augmentations and a MongoDB-specific date representation.

Take a look at our Sample Output for Database Connectors page.

 

Data format delivered: 

MongoDB native format. Each interaction is stored as 1 document.

Storage type: 

One document per interaction.

Limitations: 

16MB document size.