tags:

views:

61

answers:

3

I'm porting a PHP Web application I wrote from MySQL 5 to SQLite 3. The text encoding for both is UTF-8 (for all fields, tables, and databases). I'm having trouble transferring a geo database with special characters.

mb_detect_encoding() detects both as returning UTF-8 data.

For example,

Raw output:

MySQL (correct): Dārāb, Iran
SQLite (incorrect): DÄrÄb, Iran

JSON-encoded:

MySQL (correct): D\u0101r\u0101b, Iran
SQLite (incorrect): D\u00c4\u0081r\u00c4\u0081b, Iran

What fixes the problem:

$sqlite_output = utf8_encode($sqlite_output);
$sqlite_output = utf8_decode($sqlite_output);

I imagine there's a way of repairing the SQLite database. Thank you in advance.

A: 

You're probably going to have to transfer the data again from MySQL to SQLite. I don't think you can predictably revert back to proper encoding, as it seems SQLite interpreted utf8-input as non-utf8 or visa versa when the data first arrived, therefore not storing it in a proper format.

So try to transfer again, while making sure the whole chain of data between MySQL to SQLite is aware of the utf-8 encoding.

Alexander Sagen
A: 

Well, thanks for the advice and comments. Unfortunately, no matter which configurations I chose, it wouldn't take. I ended up simply initiating two PDO objects and, using a while loop, inserting one row at a time. (I used mysqldump's --no-data option to get the structure and modified that by hand.)

It took about 10 minutes to insert ~10,000 rows equal to 9.4MB of data on my 256MB CentOS box. (So if you're on a shared environment, be wary of the maximum execution time.) The SQLite database now returns proper Unicode data.

Note to self: It's easier to code a work-around than finding the recommended solution.

Jacob
A: 

The default PHP distribution builds libsqlite in ISO-8859-1 encoding mode. However, this is a misnomer; rather than handling ISO-8859-1, it operates according to your current locale settings for string comparisons and sort ordering. So, rather than ISO-8859-1, you should think of it as being '8-bit' instead.

e-sushi