views:

377

answers:

4

I have around 20 or so active blogs that get quite a bit of spam. As I hate CAPCHA the alternative is very smart spam filtering. I want to build a simple REST api like spam checking service which I would use in all my blogs. That way I can consolidate IP blocks and offload spam detection to 3rd party such as Akisment, Mollom, Defensio and sometime in the future write my own spam detection to really get my head into some very interesting spam detection algorithms.

My language of choice is PHP, I consider myself quite proficient and I can really dig in deep and come out with a solution. This project, I feel, can be used as a good exercise to learn another language. The big 2 that come to mind are Python and Ruby on Rails as everyone talks about them like its the next coming of our savior. Since this is mostly just an API and has no admin or public facing anything, seems like basic Python running a simple http server seems like the way to go. Am I missing anything? What would you, the great community, recommend? I would love to hear your language, book and best practices recommendations.

This has to scale and I want to write it with that in mind. Right now I'd probably be able to use 3rd party's free plans, but soon enough I'd have to expand the whole thing to actually think on its own. For now I think I'll just store everything in a MySQL database until I can do some real analysis on it. Thanks!

+9  A: 

My first question - why don't you just use one of those three services you listed? It seems they do exactly what you want. Sorry for being cynical, but I doubt that you working alone could in a reasonable amount of time beat the software engineers designing the algorithms used at those websites, especially considering their source of income depends on how well they do it.

Then again, you might just be smarter than they are =P. I'm not one to judge. In any case, I'd recommend python, for the reasons you stated - you won't need a fancy public interface, so python's lack of excellence in this area won't matter. Python is also good for doing text processing, and it has great built-in bindings for using databases (sqlite, for example; you can, of course, install MySQL if you feel it is necessary).

Downsides: it might get a bit slow, depending on how sophisticated your algorithms get.

Claudiu
Short answer: because I want to avoid setting up and depending on 3rd party service. End goal is to have thousands of installs, so when its cost effective to develop it further, I won't be a man vs mountain :)
smazurov
fair enough! also it appears those wouldn't be free if you use it that much.
Claudiu
I second Claudiu's concerns. Even though the idea of a unified API for all services is sort of appealing. ;)
Till
+2  A: 

Python has some advantages.

  1. There are several HTTP server frameworks in Python. Look at the WSGI reference implementation, and learn how to use the WSGI standard to handle web requests. It's very clean and extensible. It takes a little bit of study to see that WSGI is all about adding details to the request until you reach a stage in the processing where it's time to formulate a reply.

  2. MIME email parsing is pretty straightforward.

  3. After that, you'll be using site blacklisting and content filtering for your spam detection.

    • A site blacklist can be a big, fancy RDBMS. Or it can be simple pickled Python Set of domain names and IP addresses. I recommend a simple pickled set object that lives in memory. It's fast. You can have your RESTful service reload this set from a source file on receipt of some GET request that forces a refresh.

    • Text filtering is just hard. I'd start with SpamBayes.

S.Lott
Although the SpamBayes scripts are centred around email filtering, the tokenisation code is easily adapted to other text-classification, and the classifier can generally be left unchanged. There's an example in the source distribution that demonstrates using the SpamBayes engine as a filtering proxy, which is a similar task to this.
Tony Meyer
+1  A: 

I humbly recommend Lua, not only because it's a great, fast language, already integrated with web servers, but also because you can then exploit OSBF-Lua, an existing spam filter that has won spam-filtering competitions for several years in a row. Fidelis Assis and I have put in a lot of work trying to generalize the model beyond email, and we'd be delighted to work with you on integrating it with your app, which is what Lua was designed for.

As for scaling, in training mode we process hundreds of emails per second on a 2006 machine, so that should work out pretty well even for a busy web site.

We'd need to work with you on classifying stuff without mail headers, but I've been pushing in that direction already. For more info please write [email protected]. (Yes, I want people to send me spam. It's for research!)

Norman Ramsey
+1  A: 

I'd have to recommend Akismet for it's ease-of-use and high accuracy. With only a WordPress.com API key and an API call, you can determine if a given blob of text from a user is spammy. I've been using the Akismet plugin for WordPress, which uses the same API, and have had stellar results with it for the last year or so.

Zend Framework has a great Akismet PHP class you can use independent of the rest of the framework, which should make integration pretty straightforward. Documentation is quite thorough, as well.

Collin Allen