Best Practices for Managed Sources

It's easy to get started with Managed Sources but there are important points to keep in mind.

Advice for each source can be found on the source's overview page. On this page we summarize best practice that applies to all Managed Source types.

Advice on access tokens

Managed Sources give you access to a number of data providers that require additional authentication credentials, usually via OAuth 2.0 tokens. In order to deliver the data you want to filter for, DataSift has to have access to those credentials.

When you are using the DataSift dashboard credential acquisition is a matter of authenticating with the data source of your choice using a special pop-up dialog--DataSift receives and stores the credentials it needs without you having to do anything beyond logging in to Facebook, Google+, or other data providers available via Managed Sources.

Users of the API will need to make some additional effort, but they gain more flexibility.

Exclusivity

Whatever access credentials you decide to create and use, make sure that you only give them to DataSift. This lets us manage tokens in a way that ensures uninterrupted flow of data. If you use a token with DataSift and also use it outside DataSift, you may experience unexpected behavior.

The platform ensures that you cannot add the same token to more than one Managed Source. Again this is to ensure predictable behaviour.

Expiry Times

Some types of credentials expire after a pre-defined period of time. You can provide us with details of when tokens expire when using the /source/create and /source/update endpoints.

If you provide the expiry time with your tokens we will send an email notification five days before the token is due to expire, and notify you when a token does expire.

If a token expires it will no longer be used by a source. If all tokens for a source expire then you will no longer receive data from the source.

Which Token Is Best For You?

Your choice of access tokens will depend on your needs and your level of experience. You will most likely want to get start with Managed Sources through the DataSift dashboard, but if you want to get the most out of Managed Sources you should use the API.

One of the advantages of using the API is the freedom to use different types of access credentials or more than one set of access credentials. Another is related to the limits imposed by data providers. When you authenticate with a Managed Source using the dashboard, the access token received by DataSift may be subject to rate, age, or location limits and those limits will apply to every user. This may not be an optimal solution for you.

Watching large numbers of resources

If you want to watch a large number of resources be aware that most sources impose rate limits. That means there's a limit to the number of requests we can make on your behalf. For instance, if we can make only one request per second, it will take more than 15 minutes to check 1,000 pages for updates.

Using multiple access tokens

You can configure any Managed Source to use multiple access tokens. Configuring multiple tokens will cause the Managed Source to rotate through the tokens on each request made to the source API. This prevents the source from hitting rate limits and allows you to ingest more data from more resources.

How many access tokens do I need?

We’re often asked "How many Tokens should I use to monitor all the Resources in my Managed Sources?". The answer is: it depends.

DataSift makes API requests (authorized by your Tokens) to regularly pull content from APIs such as Facebook or Instagram for the resources you choose to monitor. How many API calls we need to make depends both on the number of resources you are monitoring, the number of posts created by each of the resources, and the amount of engagement (comments) on each of the posts.

Imagine you have a Facebook Pages Managed Source using one token, monitoring 5 resources, each with 5 posts which each have 2 pages of comments. This works out to be roughly one request to return the list of posts, then roughly two requests per post to return the two pages of comments; at least 11 API requests in total, more if you have enabled other parameters such as Page likes, Like Counts, etc. If you’re monitoring resources with significantly higher volumes of posts and engagements (such as the Facebook Page for a popular news outlet), you should expect it will dramatically increase the number of interactions returned, and the number of API calls required to return those interactions.

Typically we recommend trying to monitor no more than 100 Resources using a single Token, but as described above; the number of Resources you can monitor with a single Token varies depending on the content available from the Resources.

Source validation with a large number of resources

When you create or update a source, by default the source will be validated, including the resources you have chosen.

Validation is performed by querying the API for the social network. If your source contains many resources, then the API may return a validation error due to a timeout when validating the large collection of sources.

If you are sure your resources are valid then you can set the validate parameter to false when calling source/create or source/update to disable validation but successfully create or update your source.

Managed source health

How can I tell if I am using enough tokens?

It’s possible that the resources monitored in a Managed Source may contain more data than you can consume using a single Token; this could result in your data being delayed or potentially lost. The way you can reduce this risk of delayed or lost data is typically to provide a higher Token to resource ratio for the sources that need it most.

You can see whether your source has enough tokens configured by looking at the latency between when the interaction was created on the social network, and when the interaction was ingested into the DataSift platform.

Every interaction provided by DataSift contains the following two fields;

  • interaction.created_at - the time at which the interaction itself was created on the third party site. i.e. the time an individual left a Comment on an Instagram Post
  • interaction.received_at - the time at which DataSift ingested this interaction

You can analyze the latency of the interactions being ingested by your source by comparing these fields for each interaction you see.

There are however a couple of caveats however;

  • When you start a new Source, or new Resources to an existing source, we pull in some content posted on these new Resources over the last few days; this may skew your results as these results will contain an older timestamp than most of the content being received from a Source which has been running for a few days
  • like_count and comment_count interactions are generated by DataSift, so should not be measured in the same way as other interaction subtypes. For these you should perhaps look to measure the time between receiving *_count updates for specific posts

How can you tell if one of your Managed Sources is healthy?

You can use the /source/log endpoint to view messages relating to a source. If at any time the platform has an issue pulling data from a resource or using a specific token, it will log the resource and token pair to the source's log. You can then look at the details of these objects using the /source/get endpoint.

For example, you might try to monitor a Facebook resource that is geo-restricted (you can only access it’s content using user tokens generated by individuals living in a certain region). If the platform attempts to read content from that resource using a token you provided which was generated by an individual who does not live in that region, the request will be rejected by Facebook. This issue will be logged in the source's log.

For Facebook in particular, you may want to use their Graph API Explorer or Access Token Debugger to investigate any issues with your resources or tokens.

Avoiding duplicate interactions

Each managed source is independent. For example, if you set up two separate Instagram managed sources they will operate as if the other did not exist.

Each Managed Source will take care of deduplicating interactions that it receives. However the platform does not deduplicate interactions across different Managed Sources. If two running Managed Sources both happen to match the same interaction, you will receive that interaction twice.

If this becomes a problem you can tighten your filters to include the source.id of the source you want to filter data from. For example:

source.id == "cc2edb6093b044b3a3c5c201e33df498" and
interaction.content contains_any "color, colour"