FTP

Updated on Tuesday, 5 August, 2014 - 14:15

This connector is available only to customers on the Enterprise Edition.

 

FTP is one of the oldest file transfer protocols. Used for many years as a reliable way of distributing files, it still remains one of the most important building blocks of the internet. Setting up an FTP server is not difficult, and if you decide to use FTP, DataSift can deliver interactions as files to a designated directory.

Please be aware of the fact that FTP transmits usernames and passwords as plain text, which means they can be quite easily captured as they flow through the public internet. If you do not absolutely have to use FTP, SFTP is a much better choice.

Configuring FTP for Push delivery

To use FTP with Push delivery, follow the instructions below, skipping the steps you have already completed. We use Debian and Ubuntu Linux distributions for our examples, but the principles apply to any operating system:

  1. Update your operating system. Refer to your system's specific instructions for this step. Debian and Ubuntu users can do it using:
     
    sudo apt-get update
     
  2. Install an FTP server package on your system. The exact commands used for that purpose will differ from one operating system to another. Debian and Ubuntu users can type:

    sudo apt-get install vsftpd

    If the installation fails, check that your OS update was successful and try again.
     
  3. Edit /etc/vsftpd.conf to add the following entries to enable passive FTP mode:

    # passive FTP setup
     
    pasv_min_port=10000
    pasv_max_port=10024
     
    # the following address must be the FTP server's public IP address
     
    pasv_address=10.0.0.1
    pasv_enable=YES

    Please note that pasv_address needs a public IP address. If you are not sure what your server's public IP address is, use nslookup with your FTP server's hostname on a different machine:

    nslookup ftp.example.com

    While you are editing /etc/vsftpd.conf comment out anonymous FTP access:

    #anonymous_enable=YES
     
  4. Save your changes.
     
  5. Restart your FTP server:

    sudo /etc/init.d/vsftpd restart
     

  6. Create a new user account:

    sudo adduser dsreceiver
     
  7. Log in to your new account using ssh:

    ssh ftp.example.com
     
  8. While you are logged in as the new user, create a directory for DataSift deliveries:

    mkdir ~/datasift-ftp
     
  9. Make a note of the absolute path to the newly created directory as you will need it later. To display the directory name, type:

    cd ~/datasift-ftp
    pwd
     
  10. Try to log in to your FTP server using the new user's credentials:

    ftp ftp.example.com

    When the server responds, type in your username followed by the password.

    If the login process hangs or if the connection is rejected, check your firewall setings. The firewall must allow incoming TCP connections to ports 20, 21, and 10000-10024 for passive mode connection support. For extra security, block public access to those ports on your FTP server and only open it for your own network and the connections from the IP address range managed by DataSift. You are free to choose another range of ports for passive FTP connections, but you must remember to update /etc/vsftpd.conf and restart the FTP server. If problems persist, check that your firewall is letting outbound packets pass. Most stateful firewalls such as iptables and pf operate in that mode by default.
     
  11. Turn passive mode on with:

    passive

    The server should respond with:

    Passive mode on.

    If you see errors, check your FTP server's configuration.
     
  12. List the contents of your data delivery directory to check that passive mode is on:

    ls datasift-ftp

    The directory will be empty, but you should see the following response:

    226 Directory send OK.

    Any errors indicate problems with passive mode configuration.
     
  13. You are now ready to set up the FTP connector.
     

Configuring Push for delivery to FTP

  1. To enable delivery, you will need to define a stream or a Historics query. Both return important details required for a Push subscription. A succesful stream definition returns a hash, a Historics query returns an id. You will need either (but not both) to set the value of the hash or historic_id parameters in a call to /push/create. You need to make a call to /push/get or /historics/get to obtain that information or you can use the DataSift dashboard.
     
  2. Once you have the stream hash or the Historics id, you can supply that information to /push/create. In the example below we are making that call using curl, but you are free to use any programming language or tool.

    For more information, read the step-by-step guide to the API to learn how to use Push with DataSift's APIs.
     
  3. When a call to /push/create is successful, you will receive a response that contains a Push subscription id. You will need that information to make successful calls to all other Push API endpoints (/push/delete, /push/stop, and others). You can retrieve the list of your subscription ids with a call to /push/get.
     
  4. You should now check that the data is being delivered to your server. Log in to your FTP server and list the contents of the data delivery directory:

    ls datasift-ftp

    When DataSift is able to connect and deliver interactions to this directory, it uses filenames that follow the patterns described in the output_params.file_prefix output parameter definition later on this page. The first file it attempts to write is an empty test file called .DS_VERIFY.

    Please remember that the earliest time you can expect the first data delivery is one second after the period of time specified in the output_params.delivery_frequency parameter. If there is a longer delay, this might be due to the fact that the stream has no data in it or there is a problem with your server's configuration. If want more information, make a call to /push/log and check the value of the success field, if it is set to failure, check the value of the message field for clues. Also, make sure to make a call to /push/get and see if the response includes information about DataSift retrying to deliver data to your data delivery destination. When the status field is set to retrying, you should verify that your server is receiving data.

    Please make sure that you watch your usage and add funds to your account when it is running low. Also, stop any subscriptions that are no longer needed otherwise you will be charged for their usage. There is no need to delete them. You can can have as many stopped subscriptions as you like without paying for them. Remember that any subscriptions that were paused automatically due to insufficient funds, will resume when you add funds to your account.
     
  5. To stop delivery, call /push/stop. To remove your subscription completely, call /push/delete.
     
  6. Familiarize yourself with the output parameters (for example, the host name) you'll need to know when you send data to an FTP server.

 

Notes

Twitter sends delete messages which identify Tweets that have been deleted. Under your licensing terms, you must process these delete messages and delete the corresponding Tweets from your storage. Learn more...

Output parameters

Parameter: Description:
output_params.format
optional
default = json_meta
The output format for your data:
  • json_meta - The current default format, where each payload contains a full JSON document. It contains metadata and an "interactions" property that has an array of interactions.
  • json_array - The payload is a full JSON document, but just has an array of interactions.
  • json_new_line - The payload is NOT a full JSON document. Each interaction is flattened and separated by a line break.

If you omit this parameter or set it to json_meta, your output consists of JSON metadata followed by a JSON array of interactions (wrapped in square brackets and separated by commas).

Take a look at our Sample Output for File-Based Connectors page.

If you select json_array, DataSift omits the metadata and sends just the array of interactions.

If you select json_new_line, DataSift omits the metadata and sends each interaction as a single JSON object.

output_params.host
required
The name of the FTP host that DataSift will connect to.
output_params.port
required
The port that you want DataSift to use on your FTP server.
output_params.directory
required

The absolute path of the directory on the host that DataSift will send each file to.

output_params.delivery_frequency
required

The minimum number of seconds you want DataSift to wait before sending data again:

Learn more...

output_params.max_size
required

The maximum amount of data that DataSift will send in a single batch:

  • 102400 (100KB)
  • 256000 (250KB)
  • 512000 (500KB)
  • 1048576 (1MB)
  • 2097152 (2MB)
  • 5242880 (5MB)
  • 10485760 (10MB)
  • 20971520 (20MB)
output_params.file_prefix
optional
default = DataSift

An optional prefix to the filename. Each time Datasift delivers a file, it constructs a name in this format:

    file_prefix + subscription id + timestamp.json

output_params.auth.username
required

The username for authorization.

output_params.auth.password
required
The password for authorization.
output_params.mark_in_progress
optional

This enables you to see which files are being written and which are complete. Possible values are:

  • true
  • false

If you omit this parameter, it defaults to false. If you set it to true, we append ".part" to the filename while we're writing to it. Once the transfer is complete, we remove ".part".

output_params.compression
optional

Possible values are:

  • none
  • gzip

If you omit this parameter, it defaults to none. If you set it to gzip, we apply GZIP compression to your data and append .gz to your filenames.

 

Data format delivered: 

JSON document containing an array of JSON objects, each representing one DataSift interaction. Here's an example of the output.

Storage type: 

For each delivery, DataSift sends one file containing all the available interactions.

Limitations: 

Take care when you set the max_size and delivery_frequency output parameters. If your stream generates data at a faster rate than you permit the delivery, the buffer will fill up until we reach the point where data may be discarded.