tags:

views:

180

answers:

3

Here's my regex newbie questions:

  • How can I check if a string has 3 spam words? (for example: viagra, pills and shop)
  • How can I detect also variations of those spam words like "v-iagra" or "v.iagra" ? (one additional character)
+1  A: 

How can I check if a string has 3 spam words? (for example: viagra,pills and shop)

A regex to spot any one of those three words might look like this (Perl):

if ($string =~ /(viagra|pills|shop)/) {
    # spam
}

If you want to spot all three, a regex alone isn't really enough:

my $bad_words = 0;
while ($string =~ /(viagra|pills|shop)/g) {
     $bad_words++;
}
if ($bad_words >= 3) {
     # spam
}

How can I detect also variations of those spam words like "v-iagra" or "v.iagra" ? (one additional character)

It's not so easy to do that with just a regex. You could try something like

 $string =~ s/\W//g;

to remove all non-word characters like . and -, and then check the string using the test above. This would strip spaces too though.

Kinopiko
Don't forget \W includes underscores. Vi_agra would still get through.
AmbroseChapel
+3  A: 

Regex doesn't seem like quite the right hammer for this particular nail. For your list, you can simply throw all of you blacklisted words in a sorted list of some kind, and scan each token against that list. Direct string operations are always faster than invoking the regular expression engine du jour.

For your variations ("v-iagra", et. al) I'd remove all non-characters (as @Kinopiko suggested) and then run them past your blacklist again. If you're wary of things like "viiagra", et cetera, I'd check out Aspell. It's a great library, and looks like CPAN has a Perl binding.

Chris
what about substitutions like `\/iagra`
Brad Gilbert
Locks only keep honest men honest. If someone is really dedicated, they're going to find a way to write what they want to write. That said, for something like `\/` for a `V`, it seems like the easiest (read: most straightforward) method would need to compile a list of those types of transformations manually, store them in a map (of entries like `["\/" -> "V"]`, and then run a rote string replace from each element of that map on every incoming token.
Chris
A: 

Check this out:

http://wiki.spamihilator.com/doku.php?id=en%3Atutorials%3Aregex

If this helps don't forget to make this as the answer please.

Colour Blend