views:

408

answers:

8

I'm looking for general a strategy/advice on how to handle invalid UTF-8 input from users.

Even though my webapp uses UTF-8, somehow some users enter invalid characters. This causes errors in PHP's json_encode() and overall seems like a bad idea to have around.

W3C I18N FAQ: Multilingual Forms says "If non-UTF-8 data is received, an error message should be sent back.".

  • How exactly should this be practically done, throughout a site with dozens of different places where data can be input?
  • How do you present the error in a helpful way to the user?
  • How do you temporarily store and display bad form data so the user doesn't lose all their text? Strip bad characters? Use a replacement character, and how?
  • For existing data in the database, when invalid UTF-8 data is detected, should I try to convert it and save it back (how? utf8_encode()? mb_convert_encoding()?), or leave as-is in the database but doing something (what?) before json_encode()?

EDIT: I'm very familiar with the mbstring extension and am not asking "how does UTF-8 work in PHP". I'd like advice from people with experience in real-world situations how they've handled this.

EDIT2: As part of the solution, I'd really like to see a fast method to convert invalid characters to U+FFFD

A: 

There is a multibyte extension for PHP, check it out: http://www.php.net/manual/en/book.mbstring.php

You should try mb_check_encoding() function.

Good luck!

Otar
I'm very familiar with the mb extension, as I linked to it in my own question. Comments on this page indicate that this mb_check_encoding() does not really check for bad byte sequences, plus I'm really asking about a general strategy, not how to do one specific part.
philfreo
+2  A: 

Receiving invalid characters from your web app might have to do with the character sets assumed for HTML forms. You can specify which character set to use for forms with the accept-charset attribute:

<form action="..." accept-charset="UTF-8">

You also might want to take a look at similar questions in StackOverflow for pointers on how to handle invalid characters, e.g. those in the column to the right, but I think that signaling an error to the user is better than trying to clean up those invalid characters which cause unexpected loss of significant data or unexpected change of your user's inputs.

Archimedix
It specifies the character sets accepted by the server. I'm not sure whether it is enough to only specify UTF-8 encoding for the page - the browser could display UTF-8 while sending form data in ISO-8859-1 or something else.
Archimedix
What does `accept-charset` really do -- is it impossible for a user to submit invalid characters, or only a suggestion? How should I handle bad data if I still receive it server-side?
philfreo
According to http://stackoverflow.com/questions/3719974/is-there-any-benefit-to-adding-accept-charsetutf-8-to-html-forms-if-the-page this would be unnecessary
philfreo
I do not use this attribute myself either and have no problems with UTF-8 characters I tested so far. Referring to Pekka's comment to that question, however, the W3C specification really says that *The default value for this attribute is the reserved string "UNKNOWN". User agents **may** interpret this value as the character encoding that was used to transmit the document*, so I'm not really sure how browsers handle that. http://stackoverflow.com/questions/3719974/#comment-3926382
Archimedix
When you encounter bad data, my opinion is that you should notify the user about that and give her the opportunity to revise her input. This way, you avoid confusion and the user could work around this issue. However, it would be interesting to identify the circumstances leading to you receiving invalid data in the first place - is this caused by specific browsers, what headers are sent by client and server, what encoding is set in the browser after the page with the form is loaded etc.
Archimedix
A: 

How about stripping all chars outside your given subset. At least in some parts of my application I would not allow using chars outside the [a-Z] [0-9 sets], for example usernames. You can build a filter function that strips silently all chars outside this range, or that returns an error if it detects them and pushes the decision to the user.

Elzo Valugi
"just ignoring malformed sequences orunavailable characters does not conform to ISO 10646, will makedebugging more difficult, and can lead to user confusion." http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
philfreo
@philifreo : is what you've linked your homework or just a reference? If it's just a reference, that's because the prof is assigning a homework assignment to students and he is challenging them -- not because there is philosophical relevance to detecting bad encoding. You know the expression "the show must go on"? That applies to programming too and that is why my answer gives you the ability to either strip bad characters or return an error if they are detected.
Geekster
A: 

Try doing what Rails does to force all browsers always to post UTF-8 data:

<form accept-charset="UTF-8" action="#{action}" method="post"><div
    style="margin:0;padding:0;display:inline">
    <input name="utf8" type="hidden" value="&#x2713;" />
  </div>
  <!-- form fields -->
</form>

See railssnowman.info or the initial patch for an explanation.

  1. To have the browser sends form-submission data in the UTF-8 encoding, just render the page with a Content-Type header of "text/html; charset=utf-8" (or use a meta http-equiv tag).
  2. To have the browser sends form-submission data in the UTF-8 encoding, even if the user fiddles with the page encoding (browsers let users do that), use accept-charset="UTF-8" in the form.
  3. To have the browser sends form-submission data in the UTF-8 encoding, even if the user fiddles with the page encoding (browsers let users do that), and even if the browser is IE and the user switched the page encoding to Korean and entered Korean characters in the form fields, add a hidden input to the form with a value such as &#x2713; which can only be from the Unicode charset (and, in this example, not the Korean charset).
Justice
Does `accept-charset` really force browsers to not send any non-UTF8 data? What happens if they try to? How should I handle it on the server if this client-side validation is bypassed?
philfreo
Can you explain the hidden field as well - is that necessary?
philfreo
According to http://stackoverflow.com/questions/3719974/is-there-any-benefit-to-adding-accept-charsetutf-8-to-html-forms-if-the-page this would all be unnecessary
philfreo
I'm not sure you read that other page correctly.... I edited my answer to include the explanation of what Rails does.
Justice
This won't help protect against XSS attacks because it's client side. I believe the idea here is to purify the data coming into the system, but you can't rely on HTML flags for that.
Geekster
If a malicious client throws garbage at the server, it's OK for the server to 400 Bad Request. For well-behaved clients - browsers - use the three tricks above to avoid the server spitting back a 400 Bad Requests because of encoding mismatches.
Justice
Never rely on clients to have a browser. Think of the bots! And also think of people who use bots legitimately, such as if they do a trackback from their blog, or a pingback. You're not always going to have a browser viewing/submitting to your site. Think also of people with mobile apps that might not have the same constraints as PC browsers. Cleaning of data has to happen server side. You have to assume they are throwing garbage at you.
Geekster
Don't clean bad data from bots, just error. Cleaning bad data means transforming data in a way that does not preserve the original data just so that your app can pretend it makes sense when it doesn't. You may permit multiple encodings and server-side look at the Content-Type header to determine the charset/encoding used, and do conversions server-side from the known charset/encoding. Bots should not be doing posts, and the scripts that should be doing posts should send data in the correct charset/encoding or any of the charset/encodings that your app supports.
Justice
+8  A: 

The accept-charset="UTF-8" attribute is only a guideline for browsers to follow, they are not forced to submit that in that way, crappy form submission bots are a good example...

What I usually do is ignore bad chars, either via iconv() or with the less reliable utf8_encode() / utf8_decode() functions, if you use iconv you also have the option to transliterate bad chars.

Here is an example using iconv():

$str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str);
$str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str);

If you want to display an error message to your users I'd probably do this in a global way instead of a per value received basis, something like this would probably do just fine:

function utf8_clean($str)
{
    return iconv('UTF-8', 'UTF-8//IGNORE', $str);
}

$clean_GET = array_map('utf8_clean', $_GET);

if (serialize($_GET) != serialize($clean_GET))
{
    $_GET = $clean_GET;
    $error_msg = 'Your data is not valid UTF-8 and has been stripped.';
}

// $_GET is clean!

You may also want to normalize new lines and strip (non-)visible control chars, like this:

function Clean($string, $control = true)
{
    $string = iconv('UTF-8', 'UTF-8//IGNORE', $string);

    if ($control === true)
    {
        return preg_replace('~\p{C}+~u', '', preg_replace('~\r[\n]?~', "\n", $string));
    }

    return preg_replace('~[^\P{C}\t\n]+~u', '', preg_replace('~\r[\n]?~', "\n", $string));
}

Code to convert from UTF-8 to Unicode codepoints:

function Codepoint($char)
{
    $result = null;
    $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));

    if (is_array($codepoint) && array_key_exists(1, $codepoint))
    {
        $result = sprintf('U+%04X', $codepoint[1]);
    }

    return $result;
}

echo Codepoint('à'); // U+00E0
echo Codepoint('ひ'); // U+3072

Probably faster than any other alternative, haven't tested it extensively though.


Example:

$string = 'hello world�';

// U+FFFEhello worldU+FFFD
echo preg_replace_callback('/[\p{So}\p{Cf}\p{Co}\p{Cs}\p{Cn}]/u', 'Bad_Codepoint', $string);

function Bad_Codepoint($string)
{
    $result = array();

    foreach ((array) $string as $char)
    {
        $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));

        if (is_array($codepoint) && array_key_exists(1, $codepoint))
        {
            $result[] = sprintf('U+%04X', $codepoint[1]);
        }
    }

    return implode('', $result);
}

Is this what you were looking for?

Alix Axel
Would this method allow you to replace invalid characters with U+FFFD rather than just stripping? It seems like that'd be more helpful so the user sees exactly which chars had a problem.
philfreo
@philfreo: Not that I know of, not with iconv. But you might get away with regular expressions, something like: `preg_replace('/([^\p{L}\p{M}\p{Z}\p{N}\p{P}\p{S}\p{C}])/u', 'convert_to_unicode_notation("\\1"))', string);` - this is just from the top of my sleepy head, better regexes surely exist out there. Bare in mind that this will be considerably slower than the iconv approach though!
Alix Axel
@philfreo: Ok, this one is a must read: http://webcollab.sourceforge.net/unicode.html.
Alix Axel
Good link. I'd really like to see a *fast* method for translating invalid characters to U+FFFD.
philfreo
@philfreo: I highly doubt anything substantially faster will be available anytime soon. You could run `iconv()` and if the data has changed use the regex I posted above but wouldn't you then need to check if the transliteration of chars is being submitted and then alert the user (again)?
Alix Axel
How about something like http://us2.php.net/manual/en/function.utf8-encode.php#97533 but that instead of just testing for UTF8, replaces invalid with U+FFFD
philfreo
@philfreo: That has to be slower than the regex I've posted before.
Alix Axel
Ok, for the sake of completeness in your answer, can you include: some code, however slow, that converts invalid to `U+FFFD`, as well as a couple details on why iconv is more reliable than `utf8_encode`?
philfreo
@philfreo: Just posted some code to output Unicode code points, I suppose you know where to fit that in the whole picture. Regarding your `utf8_encode` question, the manual page says it all: "encodes **an ISO-8859-1 string** to UTF-8", it throws garbage all the time. `iconv` on the other hand is a mature C library not PHP specific, hence more reliable.
Alix Axel
@philfreo: "I'd really like to see a fast method to convert invalid characters to U+FFFD". I spend nearly an hour on this, you have to be more explicit in what you are trying to do because I'm not following...
Alix Axel
Check out http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt - "replace any malformed UTF-8 sequence by areplacement character (U+FFFD), which looks a bit like an invertedquestion mark, or a similar symbol" - http://www.fileformat.info/info/unicode/char/fffd/index.htm
philfreo
so when invalid data is found, rather than just stripping it (`//IGNORE`), the user sees which character was invalid.
philfreo
I just ran your last code snippet and got the literal text "U+FFFD" rather than having it actually replace the invalid byte sequence with the replacement character that is represented by U+FFFD
philfreo
@philfreo: That is what `iconv('UTF-8', 'UTF-8//TRANSLIT', $str)` is for.
Alix Axel
Actually, in testing some invalid utf8 data, translit doesn't actually do that for me. Are you sure? (It also causes an error, so I used //IGNORE//TRANSLIT). Translit just seems to be for things like converting €, stripping accents, etc. It doesn't convert invalid to U+FFFD.
philfreo
@philfreo: Could you share the invalid data? Also I'm pretty sure `//IGNORE//TRANSLIT` will just count as `//IGNORE`.
Alix Axel
Sure. Just try outputting this ( http://stackoverflow.com/questions/1301402/example-invalid-utf8-string/3886015#3886015 ). With //IGNORE the invalid characters are stripped. TRANSLIT does nothing in this case (but has an error without also using IGNORE). It seems ideal to replace invalid bytes with U+FFFD rather than stripping so the user can see where the problem is when they look at what was entered. If that happened, then the browser would show the U+FFFD as an upside down question mark and it would also be safe to json_encode().
philfreo
A: 

I recommend merely not allowing garbage to get in. Don't rely on custom functions, which can bog your system down. Simply walk the submitted data against an alphabet you design. Create an acceptable alphabet string and walk the submitted data, byte by byte, as if it were an array. Push acceptable characters to a new string, and omit unacceptable characters. The data you store in your database then is data triggered by the user, but not actually user-supplied data.

EDIT #4: Replacing bad character with entiy: �

EDIT #3: Updated : Sept 22 2010 @ 1:32pm Reason: Now string returned is UTF-8, plus I used the test file you provided as proof.

<?php
// build alphabet
// optionally you can remove characters from this array

$alpha[]= chr(0); // null
$alpha[]= chr(9); // tab
$alpha[]= chr(10); // new line
$alpha[]= chr(11); // tab
$alpha[]= chr(13); // carriage return

for ($i = 32; $i <= 126; $i++) {
$alpha[]= chr($i);
}

/* remove comment to check ascii ordinals */

// /*
// foreach ($alpha as $key=>$val){
//  print ord($val);
//  print '<br/>';
// }
// print '<hr/>';
//*/
// 
// //test case #1
// 
// $str = 'afsjdfhasjhdgljhasdlfy42we875y342q8957y2wkjrgSAHKDJgfcv kzXnxbnSXbcv   '.chr(160).chr(127).chr(126);
// 
// $string = teststr($alpha,$str);
// print $string;
// print '<hr/>';
// 
// //test case #2
// 
// $str = ''.'©?™???';
// $string = teststr($alpha,$str);
// print $string;
// print '<hr/>';
// 
// $str = '©';
// $string = teststr($alpha,$str);
// print $string;
// print '<hr/>';

$file = 'http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt';
$testfile = implode(chr(10),file($file));

$string = teststr($alpha,$testfile);
print $string;
print '<hr/>';


function teststr(&$alpha, &$str){
    $strlen = strlen($str);
    $newstr = chr(0); //null
    $x = 0;
    if($strlen >= 2){

        for ($i = 0; $i < $strlen; $i++) {
            $x++;
            if(in_array($str[$i],$alpha)){
                // passed
                $newstr .= $str[$i];
            }else{
                // failed
                print 'Found out of scope character. (ASCII: '.ord($str[$i]).')';
                print '<br/>';
                $newstr .= '&#65533;';
            }
        }
    }elseif($strlen <= 0){
        // failed to qualify for test
        print 'Non-existent.';

    }elseif($strlen === 1){
        $x++;
        if(in_array($str,$alpha)){
            // passed

            $newstr = $str;
        }else{
            // failed
            print 'Total character failed to qualify.';
            $newstr = '&#65533;';
        }
    }else{
        print 'Non-existent (scope).';
        }

if(mb_detect_encoding($newstr, "UTF-8") == "UTF-8"){
// skip
}else{
    $newstr = utf8_encode($newstr);
}


// test encoding:
if(mb_detect_encoding($newstr, "UTF-8")=="UTF-8"){
    print 'UTF-8 :D<br/>';
    }else{
        print 'ENCODED: '.mb_detect_encoding($newstr, "UTF-8").'<br/>';
        }




return $newstr.' (scope: '.$x.', '.$strlen.')';
}
Geekster
How do you propose doing that, when the "alphabet" is any valid UTF-8 character.
philfreo
See EDIT #1 in my answer I'm adding it now.
Geekster
Okay EDIT #1 is updated and should purify anything you want to put into JSON. Of course you can adjust the characters in your alphabet if JSON still chokes. If you could post a sample data file that is choking on JSON that'd help me fine-tune this.
Geekster
That doesn't look like it supports UTF-8 to me...
philfreo
It is now UTF-8 returned, proof.
Geekster
I have updated it to use the file you provided. Your server will need to have fopen wrappers enabled because I'm reading the URL into file(). Of course if you want you can download the file and read it in from your directory but I'm LAZY. :D
Geekster
Could you make it simply replace invalid characters with U+FFFD, as that document suggests?
philfreo
@philfreo: Updated, if you don't want any output just comment out the print rows.
Geekster
+1  A: 

I put together a fairly simple class to check if input is in UTF-8 and to run through utf8_encode() as needs be:

class utf8
{

    /**
     * @param array $data
     * @param int $options
     * @return array
     */
    public static function encode(array $data)
    {
        foreach ($data as $key=>$val) {
            if (is_array($val)) {
                $data[$key] = self::encode($val, $options);
            } else {
                if (false === self::check($val)) {
                    $data[$key] = utf8_encode($val);
                }
            }
        }

        return $data;
    }

    /**
     * Regular expression to test a string is UTF8 encoded
     * 
     * RFC3629
     * 
     * @param string $string The string to be tested
     * @return bool
     * 
     * @link http://www.w3.org/International/questions/qa-forms-utf-8.en.php
     */
    public static function check($string)
    {
        return preg_match('%^(?:
            [\x09\x0A\x0D\x20-\x7E]              # ASCII
            | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
            |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
            | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
            |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
            |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
            | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
            |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
            )*$%xs',
            $string);
    }
}

// For example
$data = utf8::encode($_POST);
Nev Stokes
A: 

For completeness to this question (not necessarily the best answer)...

function as_utf8($s) {
    return mb_convert_encoding($s, "UTF-8", mb_detect_encoding($s));
}
philfreo