tags:

views:

3692

answers:

7

I'm attempting to remove accents from characters in PHP string as the first step to making the string usable in a URL.

I'm using the following code:

$input = "Fóø Bår";

setlocale(LC_ALL, "en_US.utf8");
$output = iconv("utf-8", "ascii//TRANSLIT", $input);

print($output);

The output I would expect would be something like this:

F'oo Bar

However, instead of the accented characters being transliterated they are replaced with question marks:

F?? B?r

Everything I can find online indicates that setting the locale will fix this problem, however I'm already doing this. I've already checked the following details:

  1. The locale I am setting is supported by the server (included in the list produced by locale -a)
  2. The source and target encodings (UTF-8 and ASCII) are supported by the server's version of iconv (included in the list produced by iconv -l)
  3. The input string is UTF-8 encoded (verified using PHP's mb_check_encoding function, as suggested in the answer by mercator)
  4. The call to setlocale is successful (it returns 'en_US.utf8' rather than FALSE)


The cause of the problem:

The server is using the wrong implementation of iconv. It has the glibc version instead of the required libiconv version.

Note that the iconv function on some systems may not work as you expect. In such case, it'd be a good idea to install the GNU libiconv library. It will most likely end up with more consistent results.
PHP manual's introduction to iconv

Details about the iconv implementation that is used by PHP are included in the output of the phpinfo function.

(I'm not able to re-compile PHP with the correct iconv library on the server I'm working with for this project so the answer I've accepted below is the one that was most useful for removing accents without iconv support.)

+1  A: 

I think the problem here is that your encodings consider ä and å different symbols to 'a'. In fact, the PHP documentation for strtr offers a sample for removing accents the ugly way :(

http://ie2.php.net/strtr

Jeremy Smyth
I think you should probably suggest mb_strstr() instead, as his input is UTF8
karim79
The //TRANSLIT in the iconv call is meant to convert to the nearest available alternative in the target encoding. This should include removing accents, or converting a single character into two, e.g. ñ might become n~
georgebrock
Since the server doesn't support iconv properly, looks like I'll be doing it this way afterall. Thanks Jeremy.
georgebrock
+4  A: 

You could use urlencode. Does not quite do what you want (remove accents), but will give you a url usable string

$output = urlencode ($input);

In Perl I could use a translate regex, but I cannot think of the PHP equivalent

$input =~ tr/áâàå/aaaa/;

etc...

you could do this using preg_replace

$patterns[0] = '/[á|â|à|å|ä]/';
$patterns[1] = '/[ð|é|ê|è|ë]/';
$patterns[2] = '/[í|î|ì|ï]/';
$patterns[3] = '/[ó|ô|ò|ø|õ|ö]/';
$patterns[4] = '/[ú|û|ù|ü]/';
$patterns[5] = '/æ/';
$patterns[6] = '/ç/';
$patterns[7] = '/ß/';
$replacements[0] = 'a';
$replacements[1] = 'e';
$replacements[2] = 'i';
$replacements[3] = 'o';
$replacements[4] = 'u';
$replacements[5] = 'ae';
$replacements[6] = 'c';
$replacements[7] = 'ss';

$output = preg_replace($patterns, $replacements, $input);

(Please note this was typed from a foggy beer ridden Friday after noon memory, so may not be 100% correct)

or you could make a hash table and do a replacement based off of that.

Xetius
php equivalent of tr/.../... is strtr
streetpc
A: 

u can use this class for removing unwanted characters.. But still it does not solves your problem

openidsujoy
utf8_decode assumes ISO-8859-1 encoding and replaces everything else with "?", which is quite poor solution (and some accented characters will remain anyway).
porneL
yes u r right .
openidsujoy
+1  A: 

I agree with georgebrock's comment.

If you find a way to get //TRANSLIT to work, you can build friendly URLs:

  1. use iconv with //TRANSLIT ñ => n~
  2. remove non-alphanumeric non-whitespace chars inside words: $url = preg_replace( '/(\w)[^\w\s](\w)/', '$1$2', $url );
  3. replace remaining separations: $url = preg_replace( '/[^a-z0-9]+/', '-', $url );
  4. remove double/leading/traling: $url = preg_replace( '-', e.g. '/(?:(^|\-)\-+|\-$)/', '', $url );

If you can't get it to work, replace setp 1 with strtr/character-based replacement, like Xetius' solution.

streetpc
+1  A: 

I can't reproduce your problem. I get the expected result.

How exactly are you using mb_detect_encoding() to verify your string is in fact UTF-8?

If I simply call mb_detect_encoding($input) on both a UTF-8 and ISO-8859-1 encoded version of your string, both of them return "UTF-8", so that function isn't particularly reliable.

iconv() gives me a PHP "notice" when it gets the wrongly encoded string and only echoes "F", but that might just be because of different PHP/iconv settings/versions (?).

I suggest to you try calling mb_check_encoding($input, "utf-8") first to verify that your string really is UTF-8. I think it probably isn't.

mercator
Thanks for the tip. mb_check_encoding($input, "utf-8") is returning TRUE. Also, I was already using error_reporting(E_ALL) so there shouldn't be any errors slipping past me.
georgebrock
Hmmm, I see your point. I tried it on another machine now and that returns "Fo? Bar". What PHP and iconv versions are you using?
mercator
I think it is the iconv version that is at fault - this server is using the glibc version instead of the libiconv version.
georgebrock
Thanks mercator, you were really helpful.
georgebrock
Thanks for your explanation as well. I didn't realise it wasn't just version numbers. The difference on my end was also due to the different iconv implentations.
mercator
A: 

One of the tricks I stumbled upon on the web was using htmlentities then stripping the encoded character :

$stripped = preg_replace('`&[^;]+;`','',htmlentities($string));

Not perfect but it does work well in some case.

But, you're writing about creating an URL string, so urlencode and its counterpart urldecode may be better. Or, if you are creating a query string, use this last function : http_build_query.

A: 

I had the BOM characters showing. Extremely frustrating, until i copy and pasted my PHP code into a new document, and saved over the original. Worked like a charm.

Nicholas Maietta