I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a URI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode()
will correctly spit out %C3%BA
for ú
, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode()
also gives the same results, but it should only be used for query components.)
If you mean htmlentities()
... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset
argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars()
. This gives smaller output, and it doesn't matter(*) if you forget to include the $charset
argument, since all it changes is a couple of characters like <
and &
.
(Actually it could matter for some East Asian multibyte character sets where <
could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace()
doesn't work either).
If you wrote str_replace(..., 'ú', ...)
in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba'
to avoid the problem.
Running SET NAMES utf8
on the database before querying
Use mysql_set_charset()
in preference.