Addressing Twitter Spam Through Statistical Analysis

A brief update - top 3 things that can be done to help users weed out spam:

Make the block functionality more accessible - did you find it underneath the “Following” legend?
Provide basic stats about a user in the notification email - location, bio and some ratio information
Use backend monitoring/analysis to `killall -9` spammer accounts (block ratio, usage trends indicative of automation, etc)

As with any social network, spammers appear to take advantage of the collective masses that are gathered and interacting with each other. This is no different on Twitter, where numerous people have complained recently about massive follows from spam accounts. These accounts typically take the form of a high following:friend ratio and a low number of updates. There is even a site devoted to Twitter spam, twitterspam.com. There’s quite a bit of other information we can examine, but let’s tackle this in order of the two main types of spam I’ve come across.

The first is embodied in the @castlebaths account. Statistics that indicate this as a possible spam account:

<li><strong>20% of links</strong> in the first 20 updates are the same as the bio link</li>
<li>There are <strong>zero replies</strong> in the account (note: not unlike a new Twitter user)</li>
<li>There's an average of <strong>1.15 updates/follower</strong></li>
<li>The users "Friends" account for <strong>95% of the aggregate friends and followers</strong></li>

Now this account may very well be legitimate, but I doubt many people want to follow somebody on Twitter that is simply hawking a product and not contributing much beyond that. Taking these values and creating an aggregate score would probably score pretty high on the spam card.

Let’s take a look at another account, @kendra2. This account is a little bit more difficult to identify as spam through the numbers:

<li><strong>5% of the urls</strong> in the first 20 updates are the same as the bio link (that's one url for those not counting)</li>
<li>This account has actually replied to people</li>
<li>There are <strong>only 14 updates</strong>, but</li>
<li>The users "Friends" account for <strong>95% of the aggregate friends and followers</strong></li>

This is an interesting account since it seems to be an actual person trying to interact, but the bio link is actually the telltale sign here - videochatonline is a webcam site and @kendra2 is obviously trying to bring traffic to that site. The numbers do not clearly mark this as spam, but the last two statistics seem to indicate this account has been created solely for the purpose driving traffic outside of Twitter. Other signs are the “pretty girl” avatar, bio link to a commercial site and potentially similar profiles.

As a Twitter user, what other statistics can I use to identify spam that Twitter (or somebody else…) might be able to provide?

<li># of my friends that _also_ follow the account</li>
<li># of accounts without autofollow that are following the account</li>
<li># of inactive accounts being followed by the new user</li>
<li>Are consecutive accounts being followed?</li>

There’s also a number of back end statistics that can be utilized by Twitter such as unique IP addresses in use across large numbers accounts, clickstream rates and patterns and other similarities across multiple accounts. Reporting spam isn’t always useful, but observing the (generally predictable) behavior of spammers and the interaction of the users with those accounts is a step forward.

Is spam an easy problem? Obviously not or we wouldn’t have blog, email, trackback, comment and postal spam. Will there be false positives? Sure. However the numbers above can help in both the automatic identification of spam accounts and providing users with enough topical information to make smart decisions to help alleviate their frustration as well. Furnishing an easy means by which to report/block spam is also a necessary evil. Twitter has hummed along relatively under the spam radar until now, but it seems it has to accept that spammers will try to take advantage of its users. Giving users the power to identify and avoid spam through the use of statistics will hopefully make Twitter a fruitless source of successful spam.

Damon Cortesi's blog

Musings of an entrepreneur.

Addressing Twitter Spam Through Statistical Analysis

Comments