views:

119

answers:

1

I'm looking for good spam and ham corpora (this is the plural for corpus). I want those to use them with a bayesian spam filter developed in-house.

The only one I found so far is the trec07 corpus but this won't be very useful in my case as I want it for a forum-based website and the trec07 corpus was created from email messages and therefore is full of html tags (also the ham and spam were not separated, hope I'm not missing something in here)

In case you're wondering why not create my own corpus, the website is still to be launched hence the need for a corpus to start with. Once I get enough posts, I'll create my own corpus from the posts as this works best for bayesian filters (so that it handles the specific kind of spam you get more efficiently)

Thanks for any suggestions ..

+3  A: 

However, keep in mind that spam is, by nature, designed not to look like spam. It continuously evolves, and a filter trained on these corpora is likely to be useful mainly against the format of the spam that appears there. Spam in the wild may be very different.

Regarding HTML content in corpora, that's easy enough. Use an HTML parser and look only at the text blocks. A SAX parser can do this effortlessly by only responding to text events.

Bob Aman
Thanks Bob, but actually I mentioned the problem with TREC in my question.
Waleed Eissa
This is a forum you're doing this for, so presumably, you're supplying users with an HTML form to enter messages with. This gives you the opportunity to use anti-spam messages beyond just looking at the content. For instance: http://stackoverflow.com/questions/1558392/what-can-i-do-to-handle-bad-behavior-from-users-on-a-website/1560301#1560301
Bob Aman
Thanks a lot Bob, I think the SpamAssassin Corpora should do the job. I plan to train the filter regularly so there shouldn't be a problem if the spam evolves because this will feed back into the filter, but as I indicated in the question I have no posts right now to build a corpus, hence the need for a corpus to start with
Waleed Eissa
There's also the option for using an external service like Akismet on your forum: http://akismet.com/
Bob Aman
Oh, actually I have a post reporting functionality but thanks for noting this
Waleed Eissa
I know about Akismet but I didn't want the site to be dependent on a service from another site. I'm probably a little paranoid but I find it better this way. Besides, a bayesian spam filter running on your server should work more efficiently if you use your own corpora created from the user posts you actually receive. Akismet uses its own corpora so you won't get this.
Waleed Eissa
With these kinds of things, I'm inclined to disagree. Bayesian filters are hard to get right, and surprisingly easy to fool if you don't actually use words or if you use a lot of words that would normally appear on your site anyways. When it comes to securing websites against spam, I prefer to use content filters only as a last line of defense.
Bob Aman
Also, keep in mind that comment spam and email spam tend to serve different purposes and have different content. Training a comment spam filter against an email spam corpora may give you less-than-ideal results.
Bob Aman
Well, I do agree that Bayesian filters are not bullet-proof but they work very well. Actually, before I decide to use Bayesian filtering, I have done some extensive research on the different spam filtering methods. Bayesian filtering is not perfect but it's still the best available option. What you're referring to is called Bayesian poisoning,, I had the same thoughts like you but after some researching it seemed that Bayesian poisoning is not really something to worry about. It will be hard to explain in this little space but trust me I've done a lot of research before I ..
Waleed Eissa
.. decide to use Bayesian filtering. If it wasn't so good it wouldn't be used by google (gmail), yahoo (yahoo mail) and other large corporations. Finally, I'm only using it as a first line of defense, as I indicated I also have post reporting functionality and this is because I realize Bayesian filtering is not completely efficient, still it's sufficiently efficient IIMO.
Waleed Eissa
"Training a comment spam filter against an email spam corpora may give you less-than-ideal results.". I know, but it seems hard to find corpora for comment spam. My only source of relief is that it's all temporary, I'm going to use my own corpora once I have enough posts. I think the copora of Akismet should be the best option for comment spam but I don't think it's publicly available.
Waleed Eissa