How do I replace:
- "ã" with "a"
- "é" with "e"
in PHP? Is this possible? I've read somewhere I could do some math with the ascii value of the base character and the ascii value of the accent, but I can't find any references now.
thanks
How do I replace:
in PHP? Is this possible? I've read somewhere I could do some math with the ascii value of the base character and the ascii value of the accent, but I can't find any references now.
thanks
Check out the Normalizer class to do this. The documentation is good, so I'll just link it instead of repeating things here:
http://www.php.net/manual/en/class.normalizer.php
Specifically, the normalize member of that class:
http://www.php.net/manual/en/normalizer.normalize.php
Note that Unicode normalization has several forms, and you seem to want Normalization Form KD (NFKD) Compatibility Decomposition, though you should read the documentation to make sure.
You shouldn't try to roll your own function for this: There's way too many things that can go wrong, and using the provided function is a much better idea.
The below function was obtained from: http://php.net/manual/en/function.chr.php
Add the function, then simply call it like so:
echo normalize_special_characters($outputvariable); // outputs converted variable
/*==================================
Replaces special characters with non-special equivalents
==================================*/
function normalize_special_characters( $str )
{
# Quotes cleanup
$str = ereg_replace( chr(ord("`")), "'", $str ); # `
$str = ereg_replace( chr(ord("´")), "'", $str ); # ´
$str = ereg_replace( chr(ord("„")), ",", $str ); # „
$str = ereg_replace( chr(ord("`")), "'", $str ); # `
$str = ereg_replace( chr(ord("´")), "'", $str ); # ´
$str = ereg_replace( chr(ord("“")), "\"", $str ); # “
$str = ereg_replace( chr(ord("”")), "\"", $str ); # ”
$str = ereg_replace( chr(ord("´")), "'", $str ); # ´
$unwanted_array = array( 'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U',
'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c',
'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o',
'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y' );
$str = strtr( $str, $unwanted_array );
# Bullets, dashes, and trademarks
$str = ereg_replace( chr(149), "•", $str ); # bullet •
$str = ereg_replace( chr(150), "–", $str ); # en dash
$str = ereg_replace( chr(151), "—", $str ); # em dash
$str = ereg_replace( chr(153), "™", $str ); # trademark
$str = ereg_replace( chr(169), "©", $str ); # copyright mark
$str = ereg_replace( chr(174), "®", $str ); # registration mark
return $str;
}
People often use str_replace
or strtr
and a big list of character to convert "from" and "to" -- even if that doesn't look quite pretty...
Another solution, I suppose, might be using something like iconv
with the option //TRANSLIT
-- but doesn't always work, from what I remember...
Also, if you are using PHP 5.3, the new Normalizer
class might be interesting ;-)
If you don't have access to the Normalizer class or just don't wish to use it you can use the following function to replace most (all?) of the common accentuations.
function Unaccent($string)
{
return preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_QUOTES, 'UTF-8'));
}
Especially when matching texts against each-other or against keywords, it is helpful to normalize the texts before. The following function removes all diacritics (marks like accents) from a given UTF8-encoded texts and returns ASCii-text.
Be sure to have the PHP-Normalizer-extension (intl and icu) installed.
Tipp: You may also want to map the text to lower case before execute matching procedures ...
<?php
function normalizeUtf8String( $s)
{
// Normalizer-class missing!
if (! class_exists("Normalizer", $autoload = false))
return $original_string;
// maps German (umlauts) and other European characters onto two characters before just removing diacritics
$s = preg_replace( '@\x{00c4}@u' , "AE", $s ); // umlaut Ä => AE
$s = preg_replace( '@\x{00d6}@u' , "OE", $s ); // umlaut Ö => OE
$s = preg_replace( '@\x{00dc}@u' , "UE", $s ); // umlaut Ü => UE
$s = preg_replace( '@\x{00e4}@u' , "ae", $s ); // umlaut ä => ae
$s = preg_replace( '@\x{00f6}@u' , "oe", $s ); // umlaut ö => oe
$s = preg_replace( '@\x{00fc}@u' , "ue", $s ); // umlaut ü => ue
$s = preg_replace( '@\x{00f1}@u' , "ny", $s ); // ñ => ny
$s = preg_replace( '@\x{00ff}@u' , "yu", $s ); // ÿ => yu
// maps special characters (characters with diacritics) on their base-character followed by the diacritical mark
// exmaple: Ú => U´, á => a`
$s = Normalizer::normalize( $s, Normalizer::FORM_D );
$s = preg_replace( '@\pM@u' , "", $s ); // removes diacritics
$s = preg_replace( '@\x{00df}@u' , "ss", $s ); // maps German ß onto ss
$s = preg_replace( '@\x{00c6}@u' , "AE", $s ); // Æ => AE
$s = preg_replace( '@\x{00e6}@u' , "ae", $s ); // æ => ae
$s = preg_replace( '@\x{0132}@u' , "IJ", $s ); // ? => IJ
$s = preg_replace( '@\x{0133}@u' , "ij", $s ); // ? => ij
$s = preg_replace( '@\x{0152}@u' , "OE", $s ); // Œ => OE
$s = preg_replace( '@\x{0153}@u' , "oe", $s ); // œ => oe
$s = preg_replace( '@\x{00d0}@u' , "D", $s ); // Ð => D
$s = preg_replace( '@\x{0110}@u' , "D", $s ); // Ð => D
$s = preg_replace( '@\x{00f0}@u' , "d", $s ); // ð => d
$s = preg_replace( '@\x{0111}@u' , "d", $s ); // d => d
$s = preg_replace( '@\x{0126}@u' , "H", $s ); // H => H
$s = preg_replace( '@\x{0127}@u' , "h", $s ); // h => h
$s = preg_replace( '@\x{0131}@u' , "i", $s ); // i => i
$s = preg_replace( '@\x{0138}@u' , "k", $s ); // ? => k
$s = preg_replace( '@\x{013f}@u' , "L", $s ); // ? => L
$s = preg_replace( '@\x{0141}@u' , "L", $s ); // L => L
$s = preg_replace( '@\x{0140}@u' , "l", $s ); // ? => l
$s = preg_replace( '@\x{0142}@u' , "l", $s ); // l => l
$s = preg_replace( '@\x{014a}@u' , "N", $s ); // ? => N
$s = preg_replace( '@\x{0149}@u' , "n", $s ); // ? => n
$s = preg_replace( '@\x{014b}@u' , "n", $s ); // ? => n
$s = preg_replace( '@\x{00d8}@u' , "O", $s ); // Ø => O
$s = preg_replace( '@\x{00f8}@u' , "o", $s ); // ø => o
$s = preg_replace( '@\x{017f}@u' , "s", $s ); // ? => s
$s = preg_replace( '@\x{00de}@u' , "T", $s ); // Þ => T
$s = preg_replace( '@\x{0166}@u' , "T", $s ); // T => T
$s = preg_replace( '@\x{00fe}@u' , "t", $s ); // þ => t
$s = preg_replace( '@\x{0167}@u' , "t", $s ); // t => t
// remove all non-ASCii characters
$s = preg_replace( '@[^\0-\x80]@u' , "", $s );
// possible errors in UTF8-regular-expressions
if (empty($s))
return $original_string;
else
return $s;
}
?>
The above function is mainly based on the following article: http://ahinea.com/en/tech/accented-translate.html
use PEAR I18N_UnicodeNormalizer-1.0.0
include('…');
echo preg_replace(
'/(\P{L})/ui', // replace all except members of Unicode class "letters", case insensitive
'', // with nothing
I18N_UnicodeNormalizer::toNFKD('ÅÉÏÔÙåéïôù') // ù → u + `
);
→ AEIOUaeiou