views:

842

answers:

4

Hi all! :)

i'm writing my anti spam/badwors filter and i need if is possible,

to match (detect) only words formed by mixed characters like: fr1&nd$ and not friends

is this possible with regex!?

best regards!

A: 

You could build some regular expressions like the following:

\p{L}+[\d\p{S}]+\S*

This will match any sequence of one or more letters (\p{L}+, see Unicode character preferences), one or more digits or symbols ([\d\p{S}]+) and any following non-whitespace characters \S*.

$str = 'fr1&nd$ and not friends';
preg_match('/\p{L}+[\d\p{S}]+\S*/', $str, $match);
var_dump($match);
Gumbo
A: 

It is possible, you will not have very pretty regex rules, but you can match basically any pattern that you can describe using regex. The tricky part is describing it.

I would guess that you would have a bunch of regex rules to detect bad words like so:

To detect fr1&nd$, friends, fr*nd you can use a regex like:

/fr[1iI*][&eE]nd[s$Sz]/

Doing something like this for each rule will find all the variations of possible characters in the brackets. Pick up a regex guide for more info.

(I'm assuming for a badwords filter you would want friend as well as frie**, you may want to mask the bad word as well as all possible permutations)

Kekoa
I got bored and did this once in Perl. The regexes do look pretty hideous, especially when you try to account for misspellings.
Chris Lutz
+6  A: 

Of course it's possible with regex! You're not asking to match nested parentheses! :P

But yes, this is the kind of thing regular expressions were built for. An example:

/\S*[^\w\s]+\S*/

This will match all of the following:

@ss
as$
a$s
@$s
a$$
@s$
@$$

It will not match this:

ass

Which I believe is what you want. How it works:

\S* matches 0 or more non-space characters. [^\w\s]+ matches only the symbols (it will match anything that isn't a word or a space), and matches 1 or more of them (so a symbol character is required.) Then the \S* again matches 0 or more non-space characters (symbols and letters).

If I may be allowed to suggest a better strategy, in Perl you can store a regex in a variable. I don't know if you can do this in PHP, but if you can, you can construct a list of variables like such:

$a = /[aA@]/ # regex that matches all a-like symbols
$b = /[bB]/
$c = /[cC(]/
# etc...

Or:

$regex = array( 'a' => /[aA@]/, 'b' => /[bB]/, 'c' => /[cC(]/, ... );

So that way, you can match "friend" in all its permutations with:

/$f$r$i$e$n$d/

Or:

/$regex['f']$regex['r']$regex['i']$regex['e']$regex['n']$regex['d']/

Granted, the second one looks unnecessarily verbose, but that's PHP for you. I think the second one is probably the best solution, since it stores them all in a hash, rather than all as separate variables, but I admit that the regex it produces is a bit ugly.

Chris Lutz
Awesome Regex + Explanation +1! Btw, Regex in PHP is stored in strings, so having variable permutations like you suggest is certainly possible.
St. John Johnson
Actually, it might be interesting to write that into a function. Pass in a normal word, and it would reply with the correct regex to detect that word. Only issue I could see is something like W = \/\/ or anything multi-character.
St. John Johnson
W = !(?:[wW]|\\/\\/)! (in my native Perl). It would be more difficult for things like W with multi-character matches, but certainly possible. A function could easily be written that goes through a string, character-by-character, and looks up a regex to match that character, and then assembles them all into one giant (horrible-looking) regex, which you can use to match that word. However, I don't use PHP often enough to do it. I might do it in Perl if the whim strikes me. Or whatever that expression is supposed to be.
Chris Lutz
A: 

Didn't test this thoroughly, but this should do it:

(\w+)*(?<=[^A-Za-z ])
dr Hannibal Lecter
This matches "a " (word followed by spaces).
Chris Lutz
My bad :) I've changed it, the extra space should do it.
dr Hannibal Lecter
I would go for tabs too, but this should work.
Chris Lutz