views:

2499

answers:

3

I have a MySQL table with 120,000 lines stored in UTF-8 format. There is one field, product name, that contains text with many accents. I need to fill a second field with this same name after converting it to a url-friendly form (ASCII).

Since PHP doesn't directly handle UTF-8, I'm using:

$value = iconv ('UTF-8', 'ISO-8859-1', $value);

to convert the name to ISO-8859-1, followed by a massive strstr statement to replace any accented character by its unaccented equivalent (à becomes a, for example).

However, the original text names were entered with smart quotes, and iconv chokes whenever it comes across one -- I get:

Unknown error type: [8]

iconv() [function.iconv]: Detected an illegal character in input string

To get rid of the smart quotes before using iconv, I have tried using three statements like:

$value = str_replace('’', "'", $value);

(’ is the raw value of a UTF-8 smart single quote)

Because the text file is so long, these str_replace's cause the script to time out every single time.

  1. What is the fastest way to strip out the smart quotes (or any invalid characters) from a UTF-8 string, prior to running iconv?

  2. Or, is there an easier solution to this whole problem? What is the fastest way to convert a name with many accents, in UTF-8, to a name with no accents, spelled correctly, in ASCII?

+2  A: 

What do you mean by "link-friendly"? Only way that makes sense to me, since the text between <a>...</a> tags can be anything, is actually "URL-friendly", similar to SO's URLs where everything is converted to [a-z-].

If that's what you're going for, you'll need a transliteration library, not a character set conversion library. (I've had no luck getting iconv() to do the work in the past, but I haven't tried in a while.) There's a beta PHP extension translit that probably does the job.

If you can't add extensions to your PHP install, you'll have to look for a PHP library that does the same thing. I haven't used it, but the PHP UTF-8 library implements a utf8_to_ascii library that I assume does something like what you need.

(Also, if iconv() is failing like you said, it means that your input isn't actually valid UTF-8, so no amount of replacing valid UTF-8 with anything else will help the problem. EDIT: I may take that back: if ephemient's answer is correct, the iconv error you're seeing may very well be because there's no direct representation of the character in the destination character set. So, nevermind.)

chazomaticus
I changed the question to read url-friendly. I can't add extensions to PHP. I checked out the translit library you suggest, but it was about 35% slower than my original solution.
Andrew Swift
A: 

Have you considered using MySQL's REPLACE string function to change the offending strings into apostrophes, or whatever? You may be able to put together the "string to be replaced" part e.g. by using CONCAT on CHAR calls...

Alex Martelli
I started out using str_replace to replace the offending strings, but it slowed the script down too much ($value = str_replace('’', "'", $value); where ’ is the asci representation of the offending smart single quote). Can you clarify what you mean by CONCAT on CHAR calls?
Andrew Swift
I suggested doing the REPLACE in SQL, and using CONCAT(CHAR(...),... to compose the substring you're trying to replace, byte by byte.
Alex Martelli
+4  A: 

Glibc (and the GNU libiconv) supports //TRANSLIT and //IGNORE suffixes.

Thus, on Linux, this works just fine:

$ echo $'\xe2\x80\x99'
’
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1
iconv: illegal input sequence at position 0
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1//translit
'

I'm not sure what iconv is in use by PHP, but the documentation implies that //TRANSLIT and //IGNORE will work there too.

ephemient