If you are having trouble using DataSift, here is the recommended plan of action (in order):
Consult the documentation first. If you believe you are having problems with the DataSift platform, please refer first to our Status page to view real-time information on the status of each of our services. @DataSiftAPI is also updated to reflect the status of the platform.
The official DataSift developer support forum is dev.datasift.com/discussions. Both DataSift employees and the DataSift developer community will regularly check, update, and answer questions in the forum.
If you are having issues with the DataSift platform, or have a tip or advice for the DataSift developer community, post it there, but please search the previous discussions before opening a new thread.
Our Support help desk is available at support.datasift.com. This is primarily for subscription customers, but our support team will try to reply to every message received there.
If you have exhausted the above list of resources and feel confident your needs cannot be met there, or you need to speak directly to a support agent, or you have billing or account queries, contact the support team at support.datasift.com.
Full details of DataSift's support terms can be found on our Support Policy page.
A successful application built on DataSift is likely to attract attention. Most of that attention will be good: users singing your praises, and developers complimenting your programming. Some of that attention, though, might be negative.
This page is based very closely on material written by our friends at Twitter, who came up with a set of ideas designed to put you on a path towards better security in your application. It's not the final word; far from it. If there's anything you'd like to see added to it, please let us know. If you've discovered a security issue that directly affects DataSift, please email email@example.com.
The following threats are applicable no matter what your platform.
Authentication in DataSift requires the API key and username.
If you are retaining API keys, consider encrypting them. Twitter uses bcrypt-ruby, but there are lots of other ways to store encrypted information.
Don't assume that your users will provide you with valid, trustworthy data. Sanitize all data, checking for sane string lengths, valid file types, and so on. DataSift attempts to sanitize data POSTed to our APIs, but a little client-side help goes a long way. Whitelist the types of input that are acceptable to your application and discard everything that isn't on the whitelist.
Be sure that you're not exposing sensitive information through debugging screens/logs. Some web frameworks make it easy to access debugging information if your application is not properly configured. For desktop and mobile developers, it's easy to accidentally ship a build with debugging flags or symbols enabled. Build checks for these configurations into your deployment/build process.
Ensure that your tests (you do have tests, right?) check not just that you can do what should be able to do, but that bad guys can't do what they shouldn't be able to do. Put yourself in an attacker's mindset and whip up some evil tests.
Have you set up firstname.lastname@example.org? Do those emails go right to your phone? Make it easy for people to contact you about potential security issues with your application. If someone does report a security flaw to you, be nice to them; they've just done you a huge favor. Thank them for their time and fix the issue promptly. It's fairly common for security researchers to write about vulnerabilities they've discovered once the hole has been closed, so don't be upset if your application ends up in a blog post or research paper. Security is hard, and nobody is perfect. As long as you're fixing the issues that are reported to you, you're doing right.
Consider hiring security professionals to do an audit and/or penetration test. You can't depend solely on the kindness of strangers; for every vulnerability that someone was nice enough to report to you, there's ten more that malicious hackers have found. A good security firm will dig deep to uncover issues. Look for firms and individual consultants that do more than run a few automated tools.
If your application is (going to be) handling money, you may be required by law to adhere to certain security practices and regulations. Find out what's applicable to you and make sure you're up to code.
One easy-to-remember approach to input validation is FIEO: Filter Input, Escape Output. Filter anything from outside your application, including DataSift API data, cookie data, user-supplied form input, URL parameters, data from databases, etc. Escape all output being sent by your application, including SQL sent to your database server, HTML to you send to users' browsers, JSON/XML output sent to other systems, and commands sent to shell programs.
Generally: if HTML isn't needed from some user-facing form, filter it out; for example, there's no reason to allow anything other than integers when storing a phone number. If HTML is needed, use a known-good whitelist filter. HTMLPurifier for PHP is one such solution. Different contexts may require different filtering approaches. See the OWASP XSS Prevention Cheat Sheet for more on filtering.
If your application makes use of a database, you need to be aware of SQL injection. Again, anywhere you accept input is a potential target for an attacker to break out of their input field and into your database. Use database libraries that protect against SQL injection in a systematic way. If you break out of that approach and write custom SQL, write aggressive tests to be sure you aren't exposing yourself to this form of attack.
The two main approaches to defending against SQL injection are escaping before constructing your SQL statement and using parameterized input to create statements. The latter is recommended, as it's less prone to programmer error.
Are you sure that requests to your application are coming from your application? CSRF attacks exploit this lack of knowledge by forcing logged-in users of your site to silently open URLs that perform actions. In the case of a DataSift app, this could mean that attackers are using your app to force users to post unwanted tweets or follow spam accounts. You can learn more about this sort of attack on PHP security expert Chris Shiflett's blog.
The most thorough way to deal with CSRF is to include a random token in every form that's stored someplace trusted; if a form doesn't have the right token, throw an error. Modern web frameworks have systematic ways of handling this, and might even be doing it by default if you're lucky. A simple preventative step (but by no means the only step you should take) is to make any actions that create, modify, or destroy data require a POST request.
If you think there's an issue with your web application, how do you find out for sure? Have critical exceptions and errors emailed to you and keep good logs. You may want to put together a dashboard of critical statistics so that you can see at a glance if something is going wrong (or staying right).
We could use suggestions from desktop developers about the security issues they've run into. Developers working in sufficiently high-level languages shouldn't be dealing with buffer overflows and the usual security issues. What have you defended against?
Once you have an API key for a user, where do you keep it? Ideally, in an encrypted store managed by your operating system. On Mac OS X, this would be the Keychain. In the GNOME desktop environment, there's the Keyring. In the KDE desktop environment, there's KWallet. In Windows, use a two-way encryption tool.
The below links are great ways to learn more about security. Security is a deep topic, but don't feel like you have to learn everything about it before taking proactive steps to lock down your application. A little security-mindedness goes a long way.
You can embed a stream in your own website. There are several scenarios where this technique might be valuable:
The embedded stream might look like this:
We use a special API call: http://widget.datasift.com/embed
There's just one parameter, essence, which uniquely identifies a stream. In the UI, click Share CSDL to get the value of essence for any stream.
To include this stream on your page, add a <script> tag like this:
Thursday September 6, 2012
Wednesday August 29, 2012
- Added the twitter.mention_ids target.
Friday August 17, 2012
- Updates and fixes applied to the Salience Augmentation.
Thursday August 16, 2012
- Released the Push Delivery system.
Tuesday August 14, 2012
- Stream hash billing deduplication: Users will now only be charged DPU costs once if consuming the same stream hash multiple times.
Tuesday July 24, 2012
- A number of fixes were applied to Historics.
Monday July 9, 2012
- Released the DataSift Demographics data source.
Tuesday June 26, 2012
- Improved language augmentation by improving language classification accuracy, and adding support for over 100 new languages.
Tuesday June 19, 2012
- User Interface updates - reworked the "Recordings" page into Tasks.
- Added the 'current' argument to the /usage 'period' parameter. Allows you to return details of streams which were active over the last five minutes.
- Performance updates for recordings.
- Stability update when exporting a recording to an S3 bucket.
- Bugfix in WebSocket streaming service.
Wednesday May 30, 2012
- Added Twitter User Status messages to live streams. Please see our Twitter User Status Messages documentation for full details.
- Added the twitter.status target to return Twitter User Status messages.
Monday May 28, 2012
- Enhancements made to improve platform scalibility.
Friday May 11, 2012
- Added SSL support to our streaming API. You can now consume streams through https://stream.datasift.com
- Minor UI fixes.
Tuesday May 1, 2012
- Bugfix to correct MIME emails not being displayed correctly.
Tuesday April 24, 2012
- Performance improvements in stream filtering.
- Bugfix related to proper escaping of double quotes (") in CSV exports.
- Bugfix to ensure the klout.network object is returned as the correct data type (float).
Thursday April 19, 2012
- Added improved Salience engine, designed to perform more accurate Sentiment Analysis and Entity Extraction when working with short texts such as Tweets.
- Added a new twitter.user.verified target allowing you to only receive Tweets from "verified" accounts.
Thursday April 5, 2012
- Added a new REST API endpoint - /balance. Allows you to determine your remaining credit or DPU balance.
- Updated Klout augmentation to use new Klout API.
- A number of minor user interface fixes.
- Fixed issue related to exporting very short recordings.
- Klout augmentation bug fix.
Thursday February 23, 2012
- Added Tweet delete notifications to live streams. Please see our Twitter Deletes documentation for more information.
- Updated Twitter data source license. Please read and sign the new agreement on our Data Sources page.
- Salience sentiment score precision has been reduced from float to an integer.
- Bugfix related to stream disconnect issues. See Issue #11 for more details.
- Bugfix related to compatibility with older WebSocket versions.
Monday February 13, 2012
- New approach to the way you access licensed data feeds and augmentations.
- Our Klout data source now requires a license to be signed.
- Releasing our new Klout Profiles data source.
- Multiple simultaneous logins on the same user account are now possible.
Wednesday January 11, 2012
- Completed exports now expire after a week.
- Stability enhancements on the Recorder and Exporter. Fixed known issue: DataSift Exporter Failing.
- Minor fixes to the UI.
- Changed the return type of facebook.likes.ids to a list of strings.
- Fixed an issue with the filtering engine treating punctuation as whitespace. See related FAQ on Filtering for #hashtags in your CSDL.
Twitter sends us a notification whenever one of their users deletes a Tweet. We pass these notifications on to you as part of your stream. If you are storing Tweets you must take account of all of these delete messages in order to comply with Twitter's Terms of Service.
If you are a Twitter user with a private account, DataSift never receives your Tweets. Information from private accounts is not released by Twitter so there are no delete messages from these accounts either.
When a user deletes a Tweet, we automatically remove it from our database, so it is no longer available on our Historics service.
We recommend that you read the Data Privacy FAQ.
Here's an example of a delete message in JSON format.
Your client code must process all delete messages, removing the corresponding Tweet. You must discard the delete message itself; you are not permitted to store any delete messages and you must not create a service that publicizes delete messages.
Our own client libraries provide support for delete messages. In your chosen client library, look for an on_deleted event handler and add code to locate and delete the corresponding Tweet.
DataSift maintains a cache of data from the past five days, allowing us to map any Twitter id found in a Twitter delete request to a DataSift interaction id. If a Twitter user deletes a Tweet within five pays of posting it, DataSift looks up the corresponding interaction id, generates a delete message for you, and sends it out in your stream.
For Tweets deleted more than five days after posting, we attempt to match the Twitter user_id to the CSDL in your streams. If we find a match, we will send a delete message containing as much relevant information as we have.
A Twitter User Status Message is a message forwarded on from Twitter to you through your DataSift stream, alerting you of a change to the status of a Twitter user's account.
Twitter users expect any changes made to their settings on Twitter.com to be reflected across Twitter integrations across the web. This is important for both privacy and security reasons, as Twitter may temporarily suspend certain accounts if they believe them to be compromised. To this end, in order to comply with Twitter's Terms of Service (ToS), your application must handle each user status message as specified in the Twitter User Status Messages documentation page.
Client library support is not enough to comply with Twitter's ToS. You must ensure your implementation of the client library takes the appropriate action and acts on the user status message for the Twitter user account in question.
User status messages are sent in your streams in JSON format. Below is an example of a user_suspend message:
The different user status messages you might receive are listed on our official Twitter User Status Messages page.
You will receive these messages for any Twitter user you follow using the twitter.user.id target in your CSDL.
This must be implemented by August 1st, 2012.
Twitter sends us a notification whenever the account status of one of its users changes. We pass these notifications on to you as part of your stream when you are filtering for Tweets from these users. If you are storing Tweets, you must take account of these changes in order to comply with Twitter's Terms of Service.
Your client code must process all user status update message, and act in accordance with the guidelines laid out below:
|user_protect||User has protected their account||The implementation should no longer display, use, or transmit Tweets from this user's account until the account is unprotected||Accounts can be toggled between protected and unprotected at will|
|user_unprotect||User has unprotected their account||The implementation can now display, use, and transmit Tweets from this user's account||Accounts can be toggled between protected and unprotected at will|
|user_suspend||User's account has been suspended||The implementation should no longer display, use, or transmit Tweets from this user's account until the account is unsuspended||Accounts can be unsuspended once suspended|
|user_unsuspend||User's account has been unsuspended||The implementation can now display, use, and transmit Tweets from this user's account||Accounts can be unsuspended once suspended|
|user_delete||User's account has been deleted||The implementation should no longer display, use, or transmit Tweets from this user's account, and should remove all traces of the account||Accounts can be undeleted via Twitter's admin|
|user_undelete||User's account has been restored||The implementation can now begin to display, use, and transmit any new Tweets from this user's account||Accounts can be undeleted via Twitter's admin|
Here's an example of a 'user_suspend' message in JSON format.
In the same way as you are obliged to delete individual Tweets that you receive Delete Messages for, you must ensure your client can recognize and handle User Status messages correctly, by deleting or suspending a user profile when necessary.
You can create a stream either in DataSift's GUI or via an API call. Version control for your CSDL source code is automatically available for all the streams that you create in the GUI. Streams that you create via the API are not versioned. Here are the details you need to know about version control and versioning.
When you create a stream via an API call, Datasift gives the stream a hash..
To create a stream via the API, you use the /compile endpoint in the REST API. There is no way to edit a stream using the API so versioning is not required. If you need to alter your CSDL, simply compile a new stream.
The recommended way to consume a stream is with the Streaming API. You can consume any stream that you own. You can also consume any public stream, regardless of who owns it, as long as you know its hash.
The /compile endpoint returns the hash when you compile your stream. It's important that you make a note of the hash because there is no way to retrieve it later.
A stream that you create via the GUI can have versions. Every time that you edit the stream in the GUI, DataSift creates a new version. DataSift gives each version its own hash, unless the edit was trivial; for example: adding whitespace. The GUI allows you to see the version history and revert to an earlier version.
Once you have created a stream via the GUI, you can preview it in the GUI or consume it via the API. If you preview the stream in the GUI, you will see the output from the latest version of the stream. However, if you decide to run a stream via the API, you can select any version you want, addressing it by means of its hash.
Suppose a stream created in the GUI has seven versions: 1, 2, 3, 4, 5, 6, and 7, where 1 is the earliest version and 7 is the latest.
To see the version hsitory, select the stream and click Edit Definition. DataSift displays the history next to the CSDL code window. For our example, it looks like this:
Each of these streams has it's own hash. If you already know the hash for the version that you want to use, you can run the stream via the Streaming API immediately.
To find the hash of an older version, revert to that old version temporarily, then revert back to the latest version:
At the core of the DataSift platform is the Curated Stream Definition Language (CSDL), which offers our users a rich set of features for filtering across high volumes of real-time and historic data. Each filter definition that you create is stored within the DataSift platform, and the purpose of this document is to explain the process that DataSift uses to define, store, and access CSDL filter definitions, with particular reference to security and intellectual property considerations.
When you create a CSDL filter via our API or our User Interface, we store the definition of that filter (the CSDL code) within the DataSift platform. We generate a unique identifier known as the “stream hash” or "hash identifier". An example may look like this:
If you created the filter via our API, we send this hash directly back to you. If you created it via the User Interface, the hash is available there, just one mouse click after you save your code. We generate the hash using a one-way algorithm and we use the hash both internally and externally to uniquely identify CSDL filter definitions. We do not expose the mapping between the hash and the filter definition outside of the platform.
Given that the hashing method for generating the identifier is a one-way process, it is impossible to derive the filter definition from the hash. As a result, if you forget your CSDL definition, it is impossible for you to retrieve it from DataSift. The hash can still be used to filter data, but its associated definition cannot be discovered from the hash alone.
We control access to each filter definition in DataSift. We authenticate all requests to consume a stream and we log requests using a combination of a unique username and API key. DataSift supports the ability to change both the username and API key via our user interface.