Here's my regex newbie questions:
- How can I check if a string has 3 spam words? (for example: viagra, pills and shop)
- How can I detect also variations of those spam words like "v-iagra" or "v.iagra" ? (one additional character)
Here's my regex newbie questions:
How can I check if a string has 3 spam words? (for example: viagra,pills and shop)
A regex to spot any one of those three words might look like this (Perl):
if ($string =~ /(viagra|pills|shop)/) {
# spam
}
If you want to spot all three, a regex alone isn't really enough:
my $bad_words = 0;
while ($string =~ /(viagra|pills|shop)/g) {
$bad_words++;
}
if ($bad_words >= 3) {
# spam
}
How can I detect also variations of those spam words like "v-iagra" or "v.iagra" ? (one additional character)
It's not so easy to do that with just a regex. You could try something like
$string =~ s/\W//g;
to remove all non-word characters like . and -, and then check the string using the test above. This would strip spaces too though.
Regex doesn't seem like quite the right hammer for this particular nail. For your list, you can simply throw all of you blacklisted words in a sorted list of some kind, and scan each token against that list. Direct string operations are always faster than invoking the regular expression engine du jour.
For your variations ("v-iagra", et. al) I'd remove all non-characters (as @Kinopiko suggested) and then run them past your blacklist again. If you're wary of things like "viiagra", et cetera, I'd check out Aspell. It's a great library, and looks like CPAN has a Perl binding.
Check this out:
http://wiki.spamihilator.com/doku.php?id=en%3Atutorials%3Aregex
If this helps don't forget to make this as the answer please.