Filtering Twitter SPAM

To get the best results it's important that you filter to content that is relevant to your use case, and to remove what you consider as spam. As the question 'What is spam?' is so use-case specific, it's impossible to write a generic spam filter for all applications. Instead, here we'll look at ideas you can apply to your streams.

You can combine any of these ideas to build your ideal spam filter.

Here the targets we use are for original tweets (not retweets). You can extend the examples to include content from retweeting users easily by using targets within the twitter.retweet.user namespace.

Keeping Verified Users

One use target for Twitter content is the twitter.user.verified target. If this target exists, and the value is 1 (true) then you can rely on this user as being officially verified by Twitter as who they say they are. This is extremely useful for brands and well-known personalities.

(twitter.user.verified exists and twitter.user.verified == 1)
// OR with other spam filter conditions

Recently Created Users

A common tactic used by spam bots is to create new accounts on-the-fly and tweet from these accounts until they are shut down. You can use the twitter.user.profile_age target to see how old the profile is (in days), and filter out content from recently created users.

// Remove Twitter users who's profile was created less than a day ago
twitter.user.profile_age > 1

Users That Create Little Content

Again, similar to inspecting a user's profile age, if a user has hardly ever tweeted it may be that the user was created by a bot to deliver a small amount of tweets before it is discarded. We can use the twitter.user.statuses_count target in our filter.

// Remove users who have only ever created less than 50 tweets
twitter.user.statuses_count > 50

Users With Few Followers

If a Twitter profile is created just to post spam messages, the profile is likely to follow lots of users, but be followed by very few users itself. If this is the case then the users followers ratio (number of users who follow the user, divided by the number of users they follow) will be low, this is given by the twitter.user.followers_ratio target.

// Remove content from users with few followers 
// (relative to the number of users they follow)
twitter.user.follower_ratio >= 0.01

Users With Short Descriptions

If a Twitter profile has no description, again it could be from a bot or at least from a user who doesn't care about their public profile and has little concern for the quality of content they post. To filter for this condition we can use the twitter.user.description target and a regular expression.

// Remove content from users with no description, or the description is less than 20 characters
twitter.user.description exists AND twitter.user.description regex_exact ".{20}"

Bot-Friendly Content Sources

The interaction.source target tells you the application or service used to post content. Some services have a reputation for being 'friendly' to bots. If you know which services are being used to send spam you can filter these out:

// Remove content from bot-friendly services
not interaction.source contains_any "twittbot.net,EasyBotter,hellotxt.com,dlvr.it"

Large Numbers Of HashTags

Poor quality content tends to include many hashtags. Many hashtags might be used by spam creators to hope that they can reach as many users listening to those tags as possible, for example #win is a common hashtag. You can use the interaction.raw_content target to count the number of hashtags in a tweet.

not interaction.raw_content regex_partial "(#\\w+[^#]+){5,}"

Short Content Length

Often users will write very short posts, such as [email protected] ok' as a response to a question. This conent has little value in analysis. You can use the interaction.content target.

// Remove content with less than 20 characters
interaction.content exists AND interaction.content regex_exact ".{20}"

Content Requesting Retweets & Follows

Spam created by competitons often asks users to retweet posts or to follow profiles to win prizes. You can filter this out simply using the interaction.content target.

// Remove content requesting follows or retweets
not interaction.content contains_any "rt and follow,rt & follow,rt+follow,follow and rt,follow & rt,follow+rt"