ansaurus

Question

Truncate a UTF-8 string to fit a given byte count in PHP

Answer 1

A:

~~No. There is no way to do this other than decoding.~~ The coding is pretty mechanical however. See the pretty table in the wikipedia article

Edit: Michael Borgwardt shows us how to do it without decoding the whole string. Clever.

John Knoeller 2009-12-28 00:49:30

Answer 2

+9 A:

Edit: S.Mark's answer is actually better than mine - PHP has a (badly documented) builtin function that solves exactly this problem.

Original "back to the bits" answer follows:

Truncate to the desired byte count
If the last byte starts with 110 (binary), drop it as well
If the second-to-last byte starts with 1110 (binary), drop the last 2 bytes
If the third-to-last byte starts with 11110 (binary), drop the last 3 bytes

This ensures that you don't have an incomplete character dangling at the end, which is the main thing that can go wrong when truncating UTF-8.

Unfortunately (as Andrew reminds me in the comments) there are also cases where two separately encoded Unicode code points form a single character (basically, diacritics such as accents can be represented as separate code point modifying the preceding letter).

Handling this kind of thing requires advanced Unicode-Fu which is not available in PHP and may not even be possible for all cases (there are somne weird scripts out there!), but fortunately it's relatively rare, at least for Latin-based languages.

Michael Borgwardt 2009-12-28 00:55:58

One thing you should beware of is "decomposed form", which means you can end up losing accents from the last letter in the resulting string if you use this scheme. See: http://en.wikipedia.org/wiki/UTF-8#Precomposition_and_Decomposition

Andrew Medico 2009-12-28 04:30:52

Andrew, that's a valid point. But, the question was about chopping without yielding _invalid_ UTF-8. Your point, OTOH, is part of a more difficult question: When truncating a string, where is a good place to put the cut? Good question. It depends. If the truncated string is table index, I'd not worry about the accent problem you mention. For display with trailing …, your point is important but I might trim trailing whitespace too. Making a Tweetable string, max 140 chars (the Twit limit is chars not bytes, right?), can get quite involved.

fsb 2009-12-28 14:39:29

Yes, Twitter's limit is characters.

Michael Borgwardt 2009-12-28 15:18:13

Answer 3

+1 A:

I coded up this simple function for this purpose, you need mb_string though.

function str_truncate($string, $bytes = null)
{
    if (isset($bytes) === true)
    {
     // to speed things up
     $string = mb_substr($string, 0, $bytes, 'UTF-8');

     while (strlen($string) > $bytes)
     {
      $string = mb_substr($string, 0, -1, 'UTF-8');
     }
    }

    return $string;
}

While this code also works, S.Mark answer is obviously the way to go.

Alix Axel 2009-12-28 02:00:44

Answer 4

+4 A:

I think you don't need to reinvent the wheel, you could just use mb_strcut and make sure you set encoding to UTF-8 first.

mb_internal_encoding('UTF-8');
echo mb_strcut("\xc2\x80\xc2\x80", 0, 3); //from index 0, cut 3 characters.

its return

\xc2\x80

because in \xc2\x80\xc2, last one is invalid

S.Mark 2009-12-28 02:18:01

I think that function measures the lenght in decoded characters, so it can't be used to cut to a specific size in bytes.

Michael Borgwardt 2009-12-28 03:31:52

Could you give me any example about why it is not working to cut to specific size? in my example, its supposed to cut 3 characters, but last one is invalid, so its return 2 characters only.

S.Mark 2009-12-28 03:44:47

Because the goal is to trim the string to a certain number of *bytes*. The argument given is in terms of decoded code points, which may be 1, 2, 3, or 4 bytes (maybe more with decomposed accented characters) in UTF-8.

Andrew Medico 2009-12-28 04:34:33

I still don't get it. You could pass `1,2,3,4` , instead of `3` in my example above.

S.Mark 2009-12-28 04:38:19

I will assume myself this is wrong, I got -2 votes, So anybody lighten me up with an example?

S.Mark 2009-12-28 04:45:46

You could pass 1 as the length in your example and would get a one-character string that requires *two* bytes to encode in UTF-8, not one.

Michael Borgwardt 2009-12-28 04:48:04

if you pass 1, \xc2 alone is invalid, so you will get 0 length string, is that wrong?

S.Mark 2009-12-28 04:49:50

No, you will get \xc2\x80 because those two bytes form *one* character in UTF-8 - that's what the "mb" in all those mb_ functions means: multibyte,

Michael Borgwardt 2009-12-28 04:54:26

I may not good at mb_functions but `mb_internal_encoding('UTF-8');echo mb_strcut("\xc2\x80\xc2\x80\xc2", 0, 1);` giving me 0 length string for sure, tested.

S.Mark 2009-12-28 04:57:04

Hm, I don't have a PHP implementation ready at the moment, but could it be because c280 is a control character? Try something like c380, that should definitely count as 1 character.

Michael Borgwardt 2009-12-28 05:08:24

Tested, its same, btw, \xc2 and \x80 is 2 bytes length, not unicode \uC280. and 0xc2 is '0b11000010' in binary 0xc3 is '0b11000011' in binary, why it should be different?

S.Mark 2009-12-28 05:14:51

There's something strange there that I'll have to look into tomorrow. I did mean the two-byte sequece C280 interpreted as UTF-8 rather than the Unicode code point. And C280 might be treated different because the former is an (apparently unspecified) control character.

Michael Borgwardt 2009-12-28 05:24:00

There is obviously `0xC2 0xA2` in wiki example http://en.wikipedia.org/wiki/UTF-8#Description, so `0xC2 0x80` is also valid and proper UTF-8 sequences.

S.Mark 2009-12-28 05:39:56

OK, I tested it and have come to the conclusion that you are right and everyone else is wrong: mb_strcut() actually *does* use byte counts for length - usefully inconsistent with the other mb_ functions and horribly documented in the manual, but it is in fact the best answer to the question.

Michael Borgwardt 2009-12-28 05:43:40

It works like it should: `mb_strcut('áéíóú', 0, 4, 'UTF-8'); // áé` and `strlen(mb_strcut('áéíóú', 0, 4, 'UTF-8')); // 4`.

Alix Axel 2009-12-28 05:54:49

I pased text code in another answer that convinced me i can use `mb_strcut()`

fsb 2009-12-28 11:56:23

I think it's safe to say that the documentation of `mb_strcut()` in the PHP manual is a bit confusing.

fsb 2009-12-28 12:00:46

"incoherent" is the word I'd use - it seems to have been written by someone with a rather tentative grasp of the English language.

Michael Borgwardt 2009-12-28 14:08:42

Yes, Michael, I think the author is Japanese.

fsb 2009-12-28 14:19:57

To be fair, it's a rather complex subject matter. I can speak Japanese well enough when it comes to everyday subjects, but I'd have a hard time explaining the function as well.

Michael Borgwardt 2009-12-28 15:21:32

Thanks Michael for clarification about the mbstring issue and Hi fsb, the Japanese you mean is me or mb_strcut author? If me, Yes, I speak both but none of them are my native language though. Also thanks to Alix Axel too.

S.Mark 2009-12-28 16:51:58

Answer 5

+1 A:

Here's a test for mb_strcut(). It doesn't prove that it does just what we're looking for but I find it pretty convincing.

<?php
ini_set('default_charset', 'UTF-8' );
$strs = array(
    'Iñtërnâtiônàlizætiøn',
    'החמאס: רוצים להשלים את עסקת שליט במהירות האפשרית',
    'ايران لا ترى تغييرا في الموقف الأمريكي',
    '独・米で死傷者を出した銃の乱射事件',
    '國會預算處公布驚人的赤字數據後',
    '이며 세계 경제 회복에 걸림돌이 되고 있다',
    'В дагестанском лесном массиве южнее села Какашура',
    'นายประสิทธิ์ รุ่งสะอาด ปลัดเทศบาล รักษาการแทนนายกเทศมนตรี ต.ท่าทองใหม่',
    'ભારતીય ટીમનો સુવર્ણ યુગ : કિવીઝમાં પણ કમાલ',
    'ཁམས་དཀར་མཛེས་ས་ཁུལ་དུ་རྒྱ་གཞུང་ལ་ཞི་བའི་ངོ་རྒོལ་',
    'Χιόνια, βροχές και θυελλώδεις άνεμοι συνθέτουν το',
    'Հայաստանում սկսվել է դատական համակարգի ձեւավորումը',
    'რუსეთი ასევე გეგმავს სამხედრო');
for ( $i = 10; $i <= 30; $i += 5 ) {
    foreach ($strs as $s) {
        $t = mb_strcut($s, 0, $i, 'UTF-8');
        print(
            sprintf('%3s%3s ', mb_strlen($t, 'UTF-8'), mb_strlen($t, 'latin1'))
            . ( mb_check_encoding($t, 'UTF-8') ? ' OK  ' : ' Bad ' )
            . $t . "\n");
    }
}
?>

fsb 2009-12-28 11:57:08

Answer 6

A:

In addition to S.Mark’s answer which was mb_strcut(), I recently found another function to do a similar job: grapheme_extract($s, $n, GRAPHEME_EXTR_MAXBYTES); from the intl extension.

The functionality is a bit different: mb_strcut() documentation claims it cuts at the nearest UTF-8 character boundary, so it doesn't respect multi-character graphemes while grapheme_extract(), otoh, does. So depending what you need, grapheme_extract() might be better (e.g. to display a string) or mb_strcut() might be better (e.g. for indexing). Anyway, just though I'd mention it.

(And since intl is an ICU wrapper, I have a lot of confidence in it.)

fsb 2010-04-14 17:20:20

ansaurus

tags:

views:

answers:

Truncate a UTF-8 string to fit a given byte count in PHP

related questions