ansaurus

Question

PHP - detecting non-English letters and filtering input

Answer 1

A:

Hmm, personally I don't find a spam filter like yours too effective. IMO it is much better to watch for links, strong words, and sexual/warez related words, spam often contain them. You could restrict the commend right only for registered users, and you could delete them as moderator before they show up, if they comes from untrusted(=from unregistered user) source.

erenon 2009-11-21 19:37:52

Perhaps I wasn't clear. This is only a part of my spam-detection mechanism. I appreciate your comment, but it does not help one bit (;

sombe 2009-11-21 19:42:56

@Gal: Despite your intents, I don't find this way clear to filter msg-s based on the vowel count. Have you noticed the word msg?

erenon 2009-11-21 19:47:19

Answer 2

+1 A:

$pattern = '/[aeiouéáíúó]/';

Use the u modifier to get Unicode-aware regex and that should work, assuming you're working with UTF-8 strings throughout your app, which you should be really.

For non-latin alphabets like russian and hebrew, is there a method that I can detect which language the content belongs to and perform an appropriate spam-filtering mechanism?

Basic Russian is found in Unicode range U+0400–U+04FF; vowels are аэыуояеёюи. Hebrew is in range U+0590–U+05FF and doesn't use vowels in the same way. I don't think detecting vowels is terribly useful... you might have more luck with a simple dictionary covering many languages, as long as you stick to languages that have clear word boundaries. Not much use for Chinese.

I don't think that this sort of thing is a good anti-spam mechanism at all. It's as likely to false-positive as it is to spot spam, which is after all very often proper words. Varying spoiler fields (CSS-hidden inputs that must be left blank but won't be by bots) and one-use or limited-time submission tokens are much more likely to be effective.

bobince 2009-11-21 20:40:30

thank you very much! i'll use your advice.

sombe 2009-11-21 21:15:03

Answer 3

+1 A:

You could use the normalizer to find strings with accented characters:

<?
    if (! normalizer_is_normalized($input)) {
        // handle non-normalized input
    }
?>

If needed, you could also use this class to normalize strings to search for vowels:

<?
    $norm = normalizer_normalize($input);
    if (! preg_match('/[aeiou]/', $norm)) {
        // handle no-vowels in input
    }
?>

You'll also want to read about the default normalization form and make sure that it satisfies your requirements.

jheddings 2009-11-21 21:07:28

ansaurus

tags:

views:

answers:

PHP - detecting non-English letters and filtering input

related questions