views:

689

answers:

4

I need to insure that all my strings are utf8. Would it be better to check that input coming from a user is ascii-like or that it is utf8-like?

//KohanaPHP
function is_ascii($str) {
    return ! preg_match('/[^\x00-\x7F]/S', $str);
}

//Wordpress
function seems_utf8($Str) {
    for ($i=0; $i<strlen($Str); $i++) {
     if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
     elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
     elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
     elseif ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
     elseif ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
     elseif ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
     else return false; # Does not match any model
     for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
      if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80))
      return false;
     }
    }
    return true;
}

I did some benchmarking on 100 strings (half valid utf8/ascii and half not) and found that seems_utf8() tasks 0.011 while is_ascii only takes 0.001. But my gut is telling me that you get what you pay for and the utf8 checking would be a better choice.

I'm planning on then doing something like this convert.

<?php

/* Example data */
$string[] = 'hello';
$string[] = 'asdfghjkl;qwertyuiop[]\zxcvbnm,./]12345657890-=+_)(*&^%$#@!';
$string[] = '';
$string[] = 'accentué';
$string[] = '»á½µÎ½Ï‰Î½ Ï„á½° ';
$string[] = '???R??=8 ????? ++++¦??? ???2??????';
$string[] = 'hello¦ùó 5/5¡45-52ZÜ¿»'. "0x93". octdec('77'). decbin(26). "F???pp?? ??? ". '»á½µÎ½Ï‰Î½ Ï„á½° ';


$time = microtime(true);

//Count the successes
$true = array(1 => 0, 0 => 0);

foreach($string as $s) {
    $r = seems_utf8($s); //0.011

    print_pre(mb_substr($s, 0, 30). ' is '. ($r ? 'UTF-8' : 'non-UTF-8'));


    if( ! $r ) {

     $e = mb_detect_encoding($s, "auto");

     print_pre('Encoding: '. $e);

     //Convert
     $s = iconv($e, 'UTF-8//TRANSLIT', $s);

     print_pre(mb_substr($s, 0, 30). ' is now '. (seems_utf8($s) ? 'valid' : 'not'). ' UTF-8');
    }

}

print_pre($true);
print_pre((microtime(TRUE) - $time). ' seconds');

function print_pre() { print '<pre>'; print_r(func_get_args()); print '</pre>'; }
A: 

I'm assuming what you're doing is checking that the iconv seems necessary before executing it?

If you don't expect a very frequent occurrence of non-ASCII characters, then is_ascii seems like it would be the most efficient approach. The iconv would only need to be triggered if a > 7-bit character was encountered.

If there are likely to be high-bit characters in the checked string, then seems_utf8 might be more efficient, you will need to call iconv a lot less unless there's also a high frequency of high-bit but non-UTF8 characters.

Ben
I updated my question with some example code.
Xeoncross
+1  A: 

Making the choice between ASCII and UTF8 based on performance is probably the wrong approach. The answer really depends on your use case. If your string needs to support internationalization, you most likely go with UTF8. If your site is english only, you might go with ASCII. Or maybe you still go with UTF8. Whatever you choose, it should probably match the character encoding you set for the HTML form you serve to solicit the input from your user.

Asaph
+1  A: 

I'm not sure how necessary parts of this approach are. If you ask the user for UTF-8 input, and they give you "something else" throw it away and ask again.

The various character set detecting functions out there are universally (and tragically, necessarily) imperfect. The ones in the MB library as well as the ones in iconv aren't even that advanced compared to some of the stuff that's out there. The mb_detect_encoding basically iterates through a list of character sets and returns the first one that makes the string it has in hand look valid. In this day and age it's probably that several would return true (which is why the ordering is exposed through mb_detect_order()).

Ensure your pages are provided with the correct HTTP & HTML character set declarations, and browsers should return data in the same. To be extra specific include the accept-charset declaration in your form tag. I've yet to discover a case where this was ignored that didn't represent an attack.

To check the encoding of a byte stream, you can simply use mb_check_encoding().

preinheimer
Yes, I am mostly worried about an attack case with this question. So unless I know that the string is valid ascii or utf-8 it might be a danger to some of my string processing functions. But how can I know that it is invalid if I don't check it?
Xeoncross
A: 

If you are just trying to protect your inputs so they accept only UTF-8, I think you can just use mb_check_encoding. Something like this :

if(!mb_check_encoding($input, 'UTF-8'){
  die('Non UTF-8 character found');
}

should be enough to reject any invalid input.

Arkh
Eeven if some non-UTF-8 data made it to my site - I wanted to still support it. Although 99% of the time it it will just be an attack - perhaps someone on some weird device just can't send UTF-8.
Xeoncross