views:

552

answers:

1

Hello!

I would like to parse user inputs with PHP. I need a function which tells me if there are invalid characters in the text or not. My draft looks as follows:

<?php
function contains_invalid_characters($text) {
    for ($i = 0; $i < 3; $i++) {
        $text = html_entity_decode($text); // decode html entities
    } // loop is used for repeatedly html encoded entities
    $found = preg_match(...);
    return $found;
}
?>

The function should return TRUE if the input text contains invalid characters and FALSE if not. Valid characters should be:

a-z, A-Z, 0-9, äöüß, blank space, "!§$%&/()=[]\?.:,;-_

Can you tell me how to code this? Is preg_match() suitable for this purpose? It's also important that I can easily expand the function later so that it includes other characters.

I hope you can help me. Thanks in advance!

+3  A: 

You could use a regular expression to do that:

function contains_invalid_characters($text) {
    return (bool) preg_match('/[a-zA-Z0-9äöüß "!§$%&\/()=[\]\?.:,;\-_]/u', $text);
}

But note that you need to encode that code with the same encoding as the text you want to test. I recommend you to use UTF-8 for that.

Gumbo
Thanks! Unfortunately, it returns an "Unknown modifier" error for lots of characters. At first, the error only appears for "(" but when I strip the "(", then it appears also for other characters. Can I escape them so that it works, though?
The `/` and `]` needed to be escaped.
Gumbo
Thank you! Now I get the message "Compilation failed: invalid UTF-8 string at offset 11". This should be due to "äöüß", shouldn't it? How can I encode these characters?
What encoding do you use?
Gumbo
I use UTF-8. I can't replace the pattern by "äöüß", can I?
When you’re using UTF-8 to encode that file, there should be no errors. This error only occurs when your file is not encoded with UTF-8.
Gumbo
Ok, then my file can't be encoded with UTF-8. But to avoid this error message, I can just replace the umlauts in my text before so that I can strip them from my pattern. Thank you very much!
But isn't there any possibility to encode them? Since I get the error message, my encoding can't be UTF-8. What to do?
You can convert between encodings. See for example `utf8_encode` or `mb_convert_encoding`.
Gumbo
I don't want to convert the text since it has already the correct encoding. So I must encode the regular expression, right? But /[a-zA-Z0-9äöüß "!§$%\-_]/ doesn't work, does it?
I meant to convert the string declaration that defines the regular expression. But it would be better if you just convert the whole file and use UTF-8.
Gumbo
I think I've got the solution: First, do utf8_decode(). Then use preg_match() since there is no mb_preg_match() function for multi-byte support. At last, use utf8_encode(). This should york!?
Isn't that character class supposed to be negated? ie, '/[^a-zA-Zetc.
Alan Moore
Yes, of course! :) Thank you, Alan M.