I have recently been inspired to write spam filters in JavaScript, Greasemonkey-style, for several websites I use that are prone to spam (especially in comments). When considering my options about how to go about this, I realize I have several options, each with pros/cons. My goal for this question is to expand on this list I have created, and hopefully determine the best way of client-side spam filtering with JavaScript.
As for what makes a spam filter the "best", I would say these are the criteria:
- Most accurate
- Least vulnerable to attacks
- Fastest
- Most transparent
Also, please note that I am trying to filter content that already exists on websites that aren't mine, using Greasemonkey Userscripts. In other words, I can't prevent spam; I can only filter it.
Here is my attempt, so far, to compile a list of the various methods along with their shortcomings and benefits:
Rule-based filters:
What it does: "Grades" a message by assigning a point value to different criteria (i.e. all uppercase, all non-alphanumeric, etc.) Depending on the score, the message is discarded or kept.
Benefits:
- Easy to implement
- Mostly transparent
Shortcomings:
- Transparent- it's usually easy to reverse engineer the code to discover the rules, and thereby craft messages which won't be picked up
- Hard to balance point values (false positives)
- Can be slow; multiple rules have to be executed on each message, a lot of times using regular expressions
- In a client-side environment, server interaction or user interaction is required to update the rules
Bayesian filtering:
What it does: Analyzes word frequency (or trigram frequency) and compares it against the data it has been trained with.
Benefits:
- No need to craft rules
- Fast (relatively)
- Tougher to reverse engineer
Shortcomings:
- Requires training to be effective
- Trained data must still be accessible to JavaScript; usually in the form of human-readable JSON, XML, or flat file
- Data set can get pretty large
- Poorly designed filters are easy to confuse with a good helping of common words to lower the spamacity rating
- Words that haven't been seen before can't be accurately classified; sometimes resulting in incorrect classification of entire message
- In a client-side environment, server interaction or user interaction is required to update the rules
Bayesian filtering- server-side:
What it does: Applies Bayesian filtering server side by submitting each message to a remote server for analysis.
Benefits:
- All the benefits of regular Bayesian filtering
- Training data is not revealed to users/reverse engineers
Shortcomings:
- Heavy traffic
- Still vulnerable to uncommon words
- Still vulnerable to adding common words to decrease spamacity
- The service itself may be abused
- To train the classifier, it may be desirable to allow users to submit spam samples for training. Attackers may abuse this service
Blacklisting:
What it does: Applies a set of criteria to a message or some attribute of it. If one or more (or a specific number of) criteria match, the message is rejected. A lot like rule-based filtering, so see its description for details.
CAPTCHAs, and the like:
Not feasible for this type of application. I am trying to apply these methods to sites that already exist. Greasemonkey will be used to do this; I can't start requiring CAPTCHAs in places that they weren't before someone installed my script.
Can anyone help me fill in the blanks? Thank you,