What is the best way to filter spam with JavaScript?

I have recently been inspired to write spam filters in JavaScript, Greasemonkey-style, for several websites I use that are prone to spam (especially in comments). When considering my options about how to go about this, I realize I have several options, each with pros/cons. My goal for this question is to expand on this list I have created, and hopefully determine the best way of client-side spam filtering with JavaScript.

As for what makes a spam filter the "best", I would say these are the criteria:

Most accurate
Least vulnerable to attacks
Fastest
Most transparent

Also, please note that I am trying to filter content that already exists on websites that aren't mine, using Greasemonkey Userscripts. In other words, I can't prevent spam; I can only filter it.

Here is my attempt, so far, to compile a list of the various methods along with their shortcomings and benefits:

Rule-based filters:

What it does: "Grades" a message by assigning a point value to different criteria (i.e. all uppercase, all non-alphanumeric, etc.) Depending on the score, the message is discarded or kept.

Benefits:

Easy to implement
Mostly transparent

Shortcomings:

Transparent- it's usually easy to reverse engineer the code to discover the rules, and thereby craft messages which won't be picked up
Hard to balance point values (false positives)
Can be slow; multiple rules have to be executed on each message, a lot of times using regular expressions
In a client-side environment, server interaction or user interaction is required to update the rules

Bayesian filtering:

What it does: Analyzes word frequency (or trigram frequency) and compares it against the data it has been trained with.

Benefits:

No need to craft rules
Fast (relatively)
Tougher to reverse engineer

Shortcomings:

Requires training to be effective
Trained data must still be accessible to JavaScript; usually in the form of human-readable JSON, XML, or flat file
Data set can get pretty large
Poorly designed filters are easy to confuse with a good helping of common words to lower the spamacity rating
Words that haven't been seen before can't be accurately classified; sometimes resulting in incorrect classification of entire message
In a client-side environment, server interaction or user interaction is required to update the rules

Bayesian filtering- server-side:

What it does: Applies Bayesian filtering server side by submitting each message to a remote server for analysis.

Benefits:

All the benefits of regular Bayesian filtering
Training data is not revealed to users/reverse engineers

Shortcomings:

Heavy traffic
Still vulnerable to uncommon words
Still vulnerable to adding common words to decrease spamacity
The service itself may be abused
To train the classifier, it may be desirable to allow users to submit spam samples for training. Attackers may abuse this service

Blacklisting:

What it does: Applies a set of criteria to a message or some attribute of it. If one or more (or a specific number of) criteria match, the message is rejected. A lot like rule-based filtering, so see its description for details.

CAPTCHAs, and the like:

Not feasible for this type of application. I am trying to apply these methods to sites that already exist. Greasemonkey will be used to do this; I can't start requiring CAPTCHAs in places that they weren't before someone installed my script.

Can anyone help me fill in the blanks? Thank you,

There is no "best" way, especially for all users or all situations.

Keep it simple:

Have the GM script initially hide all comments that contain links and maybe universally bad words (F*ck, Presbyterian, etc.). ;)
Then the script contacts your server and lets the server judge each comment by X criteria (more on that, below).
Show or hide comments based on the server response. In the event of a timeout, show or reveal based on a user preference setting ("What to do when the filter server is down? (show/hide comments with links) ).
That's it for the GM script; the rest is handled by the server.

As for the actual server/filtering criteria...
Most important is do not dare to assume that you can guess what a user will want filtered! This will vary wildly from person to person, or even mood to mood.

Setup the server to use a combination of bad words, bad link destinations (.ru and .cn domains, for example) and public spam-filtering services.

The most important thing is to offer users some way to choose and ideally adjust what is applied, for them.

ansaurus

tags:

views:

answers:

What is the best way to filter spam with JavaScript?

related questions