views:

86

answers:

5

Hi,

I get UTF8 text from a database, and I want to show only the first $len characters (finishing in a word). I've tried several options but the function still doesn't work because of special characters (á, é, í, ó, etc).

Thanks for the help!

function text_limit($text, $len, $end='...')
{ 

  mb_internal_encoding('UTF-8');
  if( (mb_strlen($text, 'UTF-8') > $len) ) { 

    $text = mb_substr($text, 0, $len, 'UTF-8');
    $text = mb_substr($text, 0, mb_strrpos($text," ", 'UTF-8'), 'UTF-8');

    ...
  }
}

Edit to add an example

If I truncate a text with 65 characters, it returns:

Un jardín de estilo neoclásico acorde con el …

If I change the special characters (í, á), then it returns:

Un jardin de estilo neoclasico acorde con el Palacio de …

I'm sure there is something strange with the encoding or the server, or php; but I can't figure it out! Thanks!

Final Solution

I'm using this UTF8 PHP library and everything works now...

+1  A: 

use mb_substr. first arg the string to check second is the starting position the third is lenght and last is the encoding.

mb_substr ("String", 0, $len, 'utf-8');
Kelly Copley
this would return Str if $len was 3
Kelly Copley
mmm I'm already using that function...
fesja
woops, sorry looked over it fast and only saw strlen.
Kelly Copley
+2  A: 
mb_strrpos($text," ", 'UTF-8')

You are not passing enough args to mb_strrpos() (you have omitted the offset - 3rd param, the encoding is the 4th param), try:

mb_strrpos($text," ", 0, 'UTF-8')

Although with the 2nd line omitted it, it looks OK, like you say... "I want to show only the first $len characters (finishing in a word)" - the 2nd line makes sure it finishes on a whole word?

EDIT: mb_substr() should be cutting at $len number of characters, not bytes. Are you sure the original text is actually UTF-8 and not some other encoding?

w3d
thanks about that correction, but it doesn't work. That 2nd line deletes the last incomplete word (it searches for the space, and it cuts the text until that position).
fesja
I'm using 'mb_check_encoding($string, 'UTF-8');' to check that the string has an UTF8 encoding. My databases are in UTF8 and, the my symfony system has UTF8 as it's default charset. Any ideas on what to check? thanks!
fesja
A: 

How about trying mb_strcut(). Same params as mb_substr().

Kelly Copley
A: 

Ok, so this has been baffling me that you can't get this to work because it should work just fine. Finally I think I have come up with the reason that this is not working for you.

What I think is going on here is that your browser is displaying in the wrong encoding and you are outputting utf-8 characters.

you have a couple options. First if you are displaying any of this as part of an html page check your meta tags to see if they are setting the character encoding.. If so change it to this:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

next if you are just outputting this directly to the browser use the header function to set the character encoding like so:

header("Content-type: text/html; charset=utf-8");

an easy test:

<?php
    header("Content-type: text/html; charset=utf-8");
    $text = "áéíó";
    echo mb_substr($text, 0, 3, 'utf-8');
?>

without this your browser will default to another encoding and display the text impropperly. Hopefully this helps you fix this issue, if not I'll keep trying :)

Kelly Copley
The OP said that there's no problem with output until the function text_limit is used. Therefore meta tag is IMO set to UTF-8. BTW: Try to use edit instead of adding new and new answers ;-)
MartyIX
thanks a lot kelly but that wasn't the problem, as MartyIX said.The solution: using the following UTF8 Library, it just works now, don't ask me why http://tarski.googlecode.com/svn/branches/1.6/library/feedparser/lib-utf8.php
fesja
A: 

This could be because your original solution truncated the string to 65 bytes, which normally would equate to 65 characters in an ASCII-only context, but becomes incorrect when UTF-8's multi-byte ranges are used. When truncating a string to 65 bytes - the string itself may be of variable length depending on the number of bytes in each character. That would also probably be dangerous as you could cut a character in half (splitting the multiple bytes).

Delan Azabani