views:

86

answers:

2
+3  Q: 

PHP remove accents

What is the most efficient way to remove accents from a string eg.

"ÈâuÑ" becomes "Eaun"

Is there a simple, built in way that I'm missing or a regular expression?

+4  A: 

If you have iconv installed, try this (the example assumes your input string is in UTF-8):

echo iconv('UTF-8', 'ASCII//TRANSLIT', $string);

(iconv is a library to convert between all kinds of encodings; it's efficient and included with many PHP distributions by default. Most of all, it's definitely easier and more error-proof than trying to roll your own solution (did you know that there's a "Latin letter N with a curl"? Me neither.))

Piskvor
+1 Beat me to it. This should work best. However, note that this tends to fail if there are invalid characters in the input (using `ASCII//TRANSLIT//IGNORE` should help) and as so often, if encountering problems, the User Contributed Notes are a good read. http://www.php.net/manual/en/function.iconv.php
Pekka
For some reason, sometimes I can't get this to work. See http://codepad.viper-7.com/SUufA4 But in another machine, I got "`E^au~N". Not was desired, though.
Artefacto
Nice, simple and small and works...for me
Mark
This inconv has some conflicts so I will ask a similar question
Mark
+3  A: 

You can use iconv to transliterate the characters to plain US-ASCII and then use a regular expression to remove non-alphabetic characters:

preg_replace('/[^a-z]/i', '', iconv("UTF-8", "US-ASCII//TRANSLIT", $text))

Another way would be using the Normalizer to normalize to the Normalization Form KD (NFKD) and then remove the mark characters:

preg_replace('/\p{Mn}/u', '', Normalizer::normalize($text, Normalizer::FORM_KD))
Gumbo
`ISO-8859-1`? Are you sure? Won't this leave at least ÄÖÜ in place (as their 8859-1 counterparts)?
Pekka
What’s the reason for the down vote?
Gumbo
Downvote isn't mine. However, the OP is not asking to remove non-alphabetic characters, is he?
Pekka
It was mine. Reverted now that you fixed it.
Artefacto
@Pekka: The transliteration of `ÈâuÑ` using `iconv` gives `\`E^au~N`. That’s why the following cleanup is used.
Gumbo
@Gumbo I see. I'm sorry, we have had this discussion in a duplicate somewhere already :) +1 for the most complete solution, then, that should be made the accepted one. *Update:* If I had any votes left
Pekka
Cam you explain why NFKD?
Artefacto
By the way, what you say and your code don't match once again. FORM_D makes more sense.
Artefacto
@Artefacto: Thanks for the remark; fixed it. And take a look at figure 6 in http://unicode.org/reports/tr15/#Norm_Forms.
Gumbo
@Gumbo OK, I guess it's a matter of preference, though strictly that normalization won't take care only of the marks. See also the other question of the OP. I took some, erm, inspiration from you (basically only replaced the [a-z] regex you then had with \p{M} and left Normalizer::FORM_D.
Artefacto