views:

76

answers:

1

I have recently been inspired to write spam filters in JavaScript, Greasemonkey-style, for several websites I use that are prone to spam (especially in comments). When considering my options about how to go about this, I realize I have several options, each with pros/cons. My goal for this question is to expand on this list I have created, and hopefully determine the best way of client-side spam filtering with JavaScript.

As for what makes a spam filter the "best", I would say these are the criteria:

  • Most accurate
  • Least vulnerable to attacks
  • Fastest
  • Most transparent

Also, please note that I am trying to filter content that already exists on websites that aren't mine, using Greasemonkey Userscripts. In other words, I can't prevent spam; I can only filter it.

Here is my attempt, so far, to compile a list of the various methods along with their shortcomings and benefits:


Rule-based filters:

What it does: "Grades" a message by assigning a point value to different criteria (i.e. all uppercase, all non-alphanumeric, etc.) Depending on the score, the message is discarded or kept.

Benefits:

  • Easy to implement
  • Mostly transparent

Shortcomings:

  • Transparent- it's usually easy to reverse engineer the code to discover the rules, and thereby craft messages which won't be picked up
  • Hard to balance point values (false positives)
  • Can be slow; multiple rules have to be executed on each message, a lot of times using regular expressions
  • In a client-side environment, server interaction or user interaction is required to update the rules

Bayesian filtering:

What it does: Analyzes word frequency (or trigram frequency) and compares it against the data it has been trained with.

Benefits:

  • No need to craft rules
  • Fast (relatively)
  • Tougher to reverse engineer

Shortcomings:

  • Requires training to be effective
  • Trained data must still be accessible to JavaScript; usually in the form of human-readable JSON, XML, or flat file
  • Data set can get pretty large
  • Poorly designed filters are easy to confuse with a good helping of common words to lower the spamacity rating
  • Words that haven't been seen before can't be accurately classified; sometimes resulting in incorrect classification of entire message
  • In a client-side environment, server interaction or user interaction is required to update the rules

Bayesian filtering- server-side:

What it does: Applies Bayesian filtering server side by submitting each message to a remote server for analysis.

Benefits:

  • All the benefits of regular Bayesian filtering
  • Training data is not revealed to users/reverse engineers

Shortcomings:

  • Heavy traffic
  • Still vulnerable to uncommon words
  • Still vulnerable to adding common words to decrease spamacity
  • The service itself may be abused
  • To train the classifier, it may be desirable to allow users to submit spam samples for training. Attackers may abuse this service

Blacklisting:

What it does: Applies a set of criteria to a message or some attribute of it. If one or more (or a specific number of) criteria match, the message is rejected. A lot like rule-based filtering, so see its description for details.

CAPTCHAs, and the like:

Not feasible for this type of application. I am trying to apply these methods to sites that already exist. Greasemonkey will be used to do this; I can't start requiring CAPTCHAs in places that they weren't before someone installed my script.


Can anyone help me fill in the blanks? Thank you,

+1  A: 

There is no "best" way, especially for all users or all situations.

Keep it simple:

  1. Have the GM script initially hide all comments that contain links and maybe universally bad words (F*ck, Presbyterian, etc.). ;)
  2. Then the script contacts your server and lets the server judge each comment by X criteria (more on that, below).
  3. Show or hide comments based on the server response. In the event of a timeout, show or reveal based on a user preference setting ("What to do when the filter server is down? (show/hide comments with links) ).
  4. That's it for the GM script; the rest is handled by the server.

As for the actual server/filtering criteria...
Most important is do not dare to assume that you can guess what a user will want filtered! This will vary wildly from person to person, or even mood to mood.

Setup the server to use a combination of bad words, bad link destinations (.ru and .cn domains, for example) and public spam-filtering services.

The most important thing is to offer users some way to choose and ideally adjust what is applied, for them.

Brock Adams
"There are no bad words" -- George Carlin
Stephen P