I'm looking for good spam and ham corpora (this is the plural for corpus). I want those to use them with a bayesian spam filter developed in-house.
The only one I found so far is the trec07 corpus but this won't be very useful in my case as I want it for a forum-based website and the trec07 corpus was created from email messages and therefore is full of html tags (also the ham and spam were not separated, hope I'm not missing something in here)
In case you're wondering why not create my own corpus, the website is still to be launched hence the need for a corpus to start with. Once I get enough posts, I'll create my own corpus from the posts as this works best for bayesian filters (so that it handles the specific kind of spam you get more efficiently)
Thanks for any suggestions ..