I work on a social media monitoring system. We don't crawl the web ourselves, we get feeds from aggregators like Spinn3r. In most cases, the "blogs" that are nothing but pages of links to porn sites are filtered, but we'd like something in-house that we can train on a quicker time frame than waiting for upstream providers to make changes.
I looked at Spamassassin, and it would be ideal for our purposes if we were dealing with email. Is there any library out there that can take just a body of text, and give it a quality score based on things like work frequencies, number of links, hidden background text, and so on?
Ideally, I'm looking for something in Java, but if there's nothing there, I'd be okay with client-server or embedding a jruby or jython library.
I think I'm going to end up having to build it myself, but it's always worth a shot.