views:

194

answers:

2

Wordpress has a spam filtering plugin called Akismet that seems to be able to classify any block of text as spam or not. The only caveat being that you need to go through their interface and their database/algorithm is not open sourced or readily available otherwies.

There are also commercial providers that provide a web accessible API for you to classify the emails, comments or any other text being submitted by users in your web application.

Is there any sort of open source or freely accessible database that can classify a block of text as spam/non-spam?

Edit: Here's a clearer explanation of what I want

Basically I was hoping that there was an extensive database out there with the probabilities of certain phrases being spam. Since (I'm assuming) spammers spam all email addresses equally, by pre-populating my Bayesian spam filter with this database, I could create an application that starts off by capturing most spam without any user training.

+1  A: 

Update based on comment:

I don't think a simple database would do the trick. Most spam is algorithmicly generated (e.g. comment spam usually incorporates content from the post). Akismet does a combination of things, probably including link analysis and use of known spam signatures, but they don't publish it.

I've read about some interesting AI projects to classify good rather than bad content. You might also look at Spam Karma, which analyzes blog comments based on a variety of spammy triggers (post of response immediately after loading page, etc.).


Original answer (DNS blacklists):

Jon Galloway
I'm looking more for a database that can classify a block of text as spam or now. Akismet (a wordpress plugin), for instance can classify any comment as spam or not.
Praveen Angyan
As stated by Jon, a database isn't very useful for classification. Akismet mimics the procedural generation used to create spam rather than checking it against a database.
JoshJordan
Thanks for those links. While there are lots of algorithms out there for classifying spam, a good database of spam signatures is VERY valuable. I was hoping that someone like Wordpress or Google had published their spam signatures as a free database. Unlikely, I know. But a man can dream right?
Praveen Angyan
+1  A: 

Probably not exactly what you're looking for, but the MoinMoin Wiki maintainers keep a central list of Wiki spam regular expressions here: http://master.moinmo.in/BadContent

RichieHindle