ansaurus

Question

PHP: Fixing encoding issues with database content - removing accents from characters

Answer 1

+1 A:

To transform an UTF-8 string into an URL-safe string you should use:

$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);

The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).

Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().

As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.

djn 2010-08-07 19:22:04

I fixed the issue by re-saving the file itself in a proper UTF-8 encoding and everything worked, but thanks for the tips - I wasn't sure if changing the database encoding would have any impact on the stuff already in there, thank you for the clarification.

Matt Andrews 2010-08-08 10:47:43

Answer 2

+1 A:

I'm trying to make a URL-safe version of a string.

Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:

http://en.wikipedia.org/wiki/Medúlla

This is a valid IRI. For inclusion in a URI, you should UTF-8 and %-encode it:

http://en.wikipedia.org/wiki/Med%C3%BAlla

Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.

the conversion function doesn't see the ú character

What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)

If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.

Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.

(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)

(even a simple str_replace() doesn't work either).

If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.

It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.

Running SET NAMES utf8 on the database before querying

Use mysql_set_charset() in preference.

bobince 2010-08-07 20:39:56

You were right about the text editor saving in ANSI - I resaved (in a better text editor...) in UTF-8 and everything worked. Thank you!

Matt Andrews 2010-08-08 10:46:54

ansaurus

tags:

views:

answers:

PHP: Fixing encoding issues with database content - removing accents from characters

related questions