views:

362

answers:

6

I have the following function from the php.net site to determine the # of bytes in an ASCII and UTF-8 string:

<?php 
/** 
 * Count the number of bytes of a given string. 
 * Input string is expected to be ASCII or UTF-8 encoded. 
 * Warning: the function doesn't return the number of chars 
 * in the string, but the number of bytes. 
 * 
 * @param string $str The string to compute number of bytes 
 * 
 * @return The length in bytes of the given string. 
 */ 
function strBytes($str) 
{ 
  // STRINGS ARE EXPECTED TO BE IN ASCII OR UTF-8 FORMAT 

  // Number of characters in string 
  $strlen_var = strlen($str); 

  // string bytes counter 
  $d = 0; 

 /* 
  * Iterate over every character in the string, 
  * escaping with a slash or encoding to UTF-8 where necessary 
  */ 
  for ($c = 0; $c < $strlen_var; ++$c) { 

      $ord_var_c = ord($str{$d}); 

      switch (true) { 
          case (($ord_var_c >= 0x20) && ($ord_var_c <= 0x7F)): 
              // characters U-00000000 - U-0000007F (same as ASCII) 
              $d++; 
              break; 

          case (($ord_var_c & 0xE0) == 0xC0): 
              // characters U-00000080 - U-000007FF, mask 110XXXXX 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=2; 
              break; 

          case (($ord_var_c & 0xF0) == 0xE0): 
              // characters U-00000800 - U-0000FFFF, mask 1110XXXX 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=3; 
              break; 

          case (($ord_var_c & 0xF8) == 0xF0): 
              // characters U-00010000 - U-001FFFFF, mask 11110XXX 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=4; 
              break; 

          case (($ord_var_c & 0xFC) == 0xF8): 
              // characters U-00200000 - U-03FFFFFF, mask 111110XX 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=5; 
              break; 

          case (($ord_var_c & 0xFE) == 0xFC): 
              // characters U-04000000 - U-7FFFFFFF, mask 1111110X 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=6; 
              break; 
          default: 
            $d++;    
      } 
  } 

  return $d; 
} 
?> 

However when I try this with Russian (e.g. По своей природе компьютеры могут работать лишь с числами. И для того, чтобы они могли хранить в памяти буквы или другие символы, каждому такому символу должно быть поставлено в соответствие число.). It doesn't seem to return the correct number of bytes.

The switch statement is using the default condition. Any ideas why Russian characters would not be working as expected? Or would there be better options for this.

I am asking this as I need to shorten a UTF-8 string to a certain number of bytes. i.e. I can only send a max. of 169 bytes of JSON data to the iPhone APNS in my situation (excluding the other packet data).

Reference: PHP strlen - Manual (Paolo Comment on 10-Jan-2007 03:58)

A: 

Try mb_strlen(), it's meant to be used with multiple bytes and handles UTF-8 correctly.

That being said, it is slower than strlen().

To count the bytes, you can use mb_strlen($utf8_string, 'latin1');

echo mb_strlen('åèö', 'latin1'); // 6
Xorlev
jep. but: A multi-byte character is counted as 1. so that doesn't give us the number of bytes...
henchman
A: 

If you wish to find the byte length of a multi-byte string when you are using mbstring.func_overload 2 and UTF-8 strings, then you can use the following:

mb_strlen($utf8_string, 'latin1');
henchman
Doesn't this just give the string length in the # of characters? I need to know the actual number of bytes that is being used. Within utf-8 a character can be more than one byte, correct?
Luke
according to the comments section of http://php.net/manual/en/function.mb-strlen.php (very bottom), it's widely agreed upon that this function called in the way described will count the BYTES.when you tell the function, your input string contains latin1 (ergo: ascii) chars, he may count every byte as a character, though it may be not a valid character in ascii-sense.could you try this out? i sorrily don't have an mb-enabled environment...
henchman
+1  A: 

In PHP 5, mb_strlen should return the number of characters ; and strlen should return the number of bytes.

For instance, this portion of code :

$string = 'По своей природе компьютеры могут работать лишь с числами. И для того, чтобы они могли хранить в памяти буквы или другие символы, каждому такому символу должно быть поставлено в соответствие число';
echo mb_strlen($string, 'UTF-8') . '<br />';
echo strlen($string);

Should get you the following output :

196
359


As a sidenote : this is one the the things that PHP 6 will change : PHP 6 will be using Unicode by default, which means strlen should, in PHP 6, return the number of characters, and not a number of bytes anymore.

Pascal MARTIN
Even with PHP5 that's not an assumption you can make. strlen() may or may not be overloaded by mb_strlen(). It's safer just to call mb_strlen($string, 'latin1');
Xorlev
The function I have provided in the question seems to work fine for utf-8. I believe the issue to my problem is somewhere else in the iPhone PUSH APNS code. I seem to be able to PUSH around 160 bytes of Japanese, English text etc. However I can only PUSH around 110 bytes of Cyrillic (Russian) characters.
Luke
I still believe that strlen and mb_strlen cannot be relied on to determine the actual bytes.
Luke
+1  A: 

strlen() returns the number of bytes.

Shortening a multibyte string to a certain number of bytes is a separate task. You will need to take care not to chop the string off in the middle of a multibyte sequence as you shorten it.

The other thing you need to handle is that when you put a string into json notation, it might need more bytes to represent it as json. For example, if your string contains a double quote character. It needs to be escaped, and the backslash character will add one byte. There's other characters that need to be escaped too. Point is, it can get larger. I assume the byte limit is on the total json payload, so you do need to account for the json syntax itself, as well as any escaping that json will impose on your string.

An unoptimized, kinda hacky way to do it is to chop the string, at say 5 bytes more than your limit, using substr(). Now use mb_strlen() to get number of characters, and mb_substr() to remove the last character. Now encode it as json, and measure the bytes via strlen(). Enter a loop, which keeps chopping off the last character using mb_substr(), encodes as json, and again measure bytes using strlen(). The loop terminates when the number of bytes is acceptable.

chris
I already have a while loop that keeps chopping 1 character at a time using mb_substr until the bytes falls below the limit. strlen, doesn't seem to return the same # of bytes as the function in my question. strlen() may or may not be overloaded by mb_strlen() as per other comments, due to this it shouldn't be relied on.
Luke
So don't overload strlen. If you don't control it, then there's other ways. Eg while (isset($str[$i])) $i++; will do the trick. Or fwrite() it to a stream or something...
chris
+2  A: 

I am asking this as I need to shorten a utf-8 string to a certain number of bytes.

mb_strcut() does exactly this, though you might not be able to tell from the barely comprehensible documentation.

Michael Borgwardt
Thank you, using mb_strcut() is better than mb_substr() for my situation.
Luke
A: 

Count of Bytes <> String length!

to get the count of byte you can use (php4,5) strlen. to get the unicode string (utf8 encoded) length you can use mb_strlen (take care about function overloading from that extension) or you can simply count all bytes which do not have the 8th bit set.

8th bit means for this unicodechar is coming at least one more byte from input.

Bernd Ott