tags:

views:

387

answers:

6

Say we have a UTF-8 string $s and we need to shorten it so it can be stored in N bytes. Blindly truncating it to N bytes could mess it up. But decoding it to find the character boundaries is a drag. Is there a tidy way?

[Edit 20100414] In addition to S.Mark’s answer: mb_strcut(), I recently found another function to do the job: grapheme_extract($s, $n, GRAPHEME_EXTR_MAXBYTES); from the intl extension. Since intl is an ICU wrapper, I have a lot of confidence in it.

A: 

No. There is no way to do this other than decoding. The coding is pretty mechanical however. See the pretty table in the wikipedia article

Edit: Michael Borgwardt shows us how to do it without decoding the whole string. Clever.

John Knoeller
+9  A: 

Edit: S.Mark's answer is actually better than mine - PHP has a (badly documented) builtin function that solves exactly this problem.

Original "back to the bits" answer follows:

  • Truncate to the desired byte count
  • If the last byte starts with 110 (binary), drop it as well
  • If the second-to-last byte starts with 1110 (binary), drop the last 2 bytes
  • If the third-to-last byte starts with 11110 (binary), drop the last 3 bytes

This ensures that you don't have an incomplete character dangling at the end, which is the main thing that can go wrong when truncating UTF-8.

Unfortunately (as Andrew reminds me in the comments) there are also cases where two separately encoded Unicode code points form a single character (basically, diacritics such as accents can be represented as separate code point modifying the preceding letter).

Handling this kind of thing requires advanced Unicode-Fu which is not available in PHP and may not even be possible for all cases (there are somne weird scripts out there!), but fortunately it's relatively rare, at least for Latin-based languages.

Michael Borgwardt
One thing you should beware of is "decomposed form", which means you can end up losing accents from the last letter in the resulting string if you use this scheme. See: http://en.wikipedia.org/wiki/UTF-8#Precomposition_and_Decomposition
Andrew Medico
Andrew, that's a valid point. But, the question was about chopping without yielding _invalid_ UTF-8. Your point, OTOH, is part of a more difficult question: When truncating a string, where is a good place to put the cut? Good question. It depends. If the truncated string is table index, I'd not worry about the accent problem you mention. For display with trailing …, your point is important but I might trim trailing whitespace too. Making a Tweetable string, max 140 chars (the Twit limit is chars not bytes, right?), can get quite involved.
fsb
Yes, Twitter's limit is characters.
Michael Borgwardt
+1  A: 

I coded up this simple function for this purpose, you need mb_string though.

function str_truncate($string, $bytes = null)
{
    if (isset($bytes) === true)
    {
     // to speed things up
     $string = mb_substr($string, 0, $bytes, 'UTF-8');

     while (strlen($string) > $bytes)
     {
      $string = mb_substr($string, 0, -1, 'UTF-8');
     }
    }

    return $string;
}

While this code also works, S.Mark answer is obviously the way to go.

Alix Axel
+4  A: 

I think you don't need to reinvent the wheel, you could just use mb_strcut and make sure you set encoding to UTF-8 first.

mb_internal_encoding('UTF-8');
echo mb_strcut("\xc2\x80\xc2\x80", 0, 3); //from index 0, cut 3 characters.

its return

\xc2\x80

because in \xc2\x80\xc2, last one is invalid

S.Mark
I think that function measures the lenght in decoded characters, so it can't be used to cut to a specific size in bytes.
Michael Borgwardt
Could you give me any example about why it is not working to cut to specific size? in my example, its supposed to cut 3 characters, but last one is invalid, so its return 2 characters only.
S.Mark
Because the goal is to trim the string to a certain number of *bytes*. The argument given is in terms of decoded code points, which may be 1, 2, 3, or 4 bytes (maybe more with decomposed accented characters) in UTF-8.
Andrew Medico
I still don't get it. You could pass `1,2,3,4` , instead of `3` in my example above.
S.Mark
I will assume myself this is wrong, I got -2 votes, So anybody lighten me up with an example?
S.Mark
You could pass 1 as the length in your example and would get a one-character string that requires *two* bytes to encode in UTF-8, not one.
Michael Borgwardt
if you pass 1, \xc2 alone is invalid, so you will get 0 length string, is that wrong?
S.Mark
No, you will get \xc2\x80 because those two bytes form *one* character in UTF-8 - that's what the "mb" in all those mb_ functions means: multibyte,
Michael Borgwardt
I may not good at mb_functions but `mb_internal_encoding('UTF-8');echo mb_strcut("\xc2\x80\xc2\x80\xc2", 0, 1);` giving me 0 length string for sure, tested.
S.Mark
Hm, I don't have a PHP implementation ready at the moment, but could it be because c280 is a control character? Try something like c380, that should definitely count as 1 character.
Michael Borgwardt
Tested, its same, btw, \xc2 and \x80 is 2 bytes length, not unicode \uC280. and 0xc2 is '0b11000010' in binary 0xc3 is '0b11000011' in binary, why it should be different?
S.Mark
There's something strange there that I'll have to look into tomorrow. I did mean the two-byte sequece C280 interpreted as UTF-8 rather than the Unicode code point. And C280 might be treated different because the former is an (apparently unspecified) control character.
Michael Borgwardt
There is obviously `0xC2 0xA2` in wiki example http://en.wikipedia.org/wiki/UTF-8#Description, so `0xC2 0x80` is also valid and proper UTF-8 sequences.
S.Mark
OK, I tested it and have come to the conclusion that you are right and everyone else is wrong: mb_strcut() actually *does* use byte counts for length - usefully inconsistent with the other mb_ functions and horribly documented in the manual, but it is in fact the best answer to the question.
Michael Borgwardt
It works like it should: `mb_strcut('áéíóú', 0, 4, 'UTF-8'); // áé` and `strlen(mb_strcut('áéíóú', 0, 4, 'UTF-8')); // 4`.
Alix Axel
I pased text code in another answer that convinced me i can use `mb_strcut()`
fsb
I think it's safe to say that the documentation of `mb_strcut()` in the PHP manual is a bit confusing.
fsb
"incoherent" is the word I'd use - it seems to have been written by someone with a rather tentative grasp of the English language.
Michael Borgwardt
Yes, Michael, I think the author is Japanese.
fsb
To be fair, it's a rather complex subject matter. I can speak Japanese well enough when it comes to everyday subjects, but I'd have a hard time explaining the function as well.
Michael Borgwardt
Thanks Michael for clarification about the mbstring issue and Hi fsb, the Japanese you mean is me or mb_strcut author? If me, Yes, I speak both but none of them are my native language though. Also thanks to Alix Axel too.
S.Mark
+1  A: 

Here's a test for mb_strcut(). It doesn't prove that it does just what we're looking for but I find it pretty convincing.

<?php
ini_set('default_charset', 'UTF-8' );
$strs = array(
    'Iñtërnâtiônàlizætiøn',
    'החמאס: רוצים להשלים את עסקת שליט במהירות האפשרית',
    'ايران لا ترى تغييرا في الموقف الأمريكي',
    '独・米で死傷者を出した銃の乱射事件',
    '國會預算處公布驚人的赤字數據後',
    '이며 세계 경제 회복에 걸림돌이 되고 있다',
    'В дагестанском лесном массиве южнее села Какашура',
    'นายประสิทธิ์ รุ่งสะอาด ปลัดเทศบาล รักษาการแทนนายกเทศมนตรี ต.ท่าทองใหม่',
    'ભારતીય ટીમનો સુવર્ણ યુગ : કિવીઝમાં પણ કમાલ',
    'ཁམས་དཀར་མཛེས་ས་ཁུལ་དུ་རྒྱ་གཞུང་ལ་ཞི་བའི་ངོ་རྒོལ་',
    'Χιόνια, βροχές και θυελλώδεις άνεμοι συνθέτουν το',
    'Հայաստանում սկսվել է դատական համակարգի ձեւավորումը',
    'რუსეთი ასევე გეგმავს სამხედრო');
for ( $i = 10; $i <= 30; $i += 5 ) {
    foreach ($strs as $s) {
        $t = mb_strcut($s, 0, $i, 'UTF-8');
        print(
            sprintf('%3s%3s ', mb_strlen($t, 'UTF-8'), mb_strlen($t, 'latin1'))
            . ( mb_check_encoding($t, 'UTF-8') ? ' OK  ' : ' Bad ' )
            . $t . "\n");
    }
}
?>
fsb
A: 

In addition to S.Mark’s answer which was mb_strcut(), I recently found another function to do a similar job: grapheme_extract($s, $n, GRAPHEME_EXTR_MAXBYTES); from the intl extension.

The functionality is a bit different: mb_strcut() documentation claims it cuts at the nearest UTF-8 character boundary, so it doesn't respect multi-character graphemes while grapheme_extract(), otoh, does. So depending what you need, grapheme_extract() might be better (e.g. to display a string) or mb_strcut() might be better (e.g. for indexing). Anyway, just though I'd mention it.

(And since intl is an ICU wrapper, I have a lot of confidence in it.)

fsb