views:

420

answers:

4

I'm writing a php script to export MySQL database rows into a .txt file formatted for Adobe InDesign's internal markup.

Exports work, but when I encounter special characters like é or umlauts, I get weird symbols (eg Chloë Hanslip instead of Chloë Hanslip). Rather than run a search and replace for every possible weird character, I need a better method.

I've checked that when the text hits the database, it's saved properly - in the database I see the special characters. My export code basically runs some regular expressions to put in the InDesign code tags, and I'm left with the weird symbols. If I just output the text to the browser (rather than prompt for a text file download), it displays properly. When I save the file I use this code:

header("Content-disposition: attachment; filename=test.txt");

header("Content-Type: text/plain; charset=utf-8");

I've tried various combinations of utf8_encode() and iconv() to no avail. Can anybody point me in the right direction here?

+2  A: 

Before export you can use SET NAMES command for change the encoding of transmission eg:

SET NAMES utf8;

You can configure this in your mysql backuper software.

Svisstack
+1  A: 

just call in PHP after DB connection methods mysql_set_charset('utf8');

maid450
+4  A: 

InDesign wouldn't be able to use any encoding specified in the header. (It wouldn't even see it, as it's not kept when you save to disc in Windows.) Instead you have to explicitly tell it the encoding in a special tag of its own at the start of the file, such as:

<ANSI-WIN>

Unfortunately, it does not use standard encoding names and there is no tag that InDesign understands that corresponds to UTF-8 encoding at all. The only encoding tag you can use that will allow you to include any character you like is:

<UNICODE-WIN>

which corresponds to UTF-16 (little-endian with BOM), with Windows CRLF line endings. (The only other line ending option is MAC, which you don't want at all as it's old-school pre-OSX Macs where the line ending character was CR.)

So, given a UTF-8 string $s including UTF-8 byte sequences you've pulled out of the database and plain (Unix-Linux-OSX-web-style) LF newlines, you'd write it like this:

$s= "<UNICODE-WIN>\r\n".str_replace("\n", "\r\n", $s);
echo iconv('UTF-8', 'UTF-16', $s);

(Ensuring not to output any whitespace before or after, because it'll break the UTF-16 encoding.

bobince
Thank you for your answer.The InDesign files I'm generating all begin with <ASCII-WIN> - I didn't know about those other options, thank you. However, my problem occurs _before_ the code reaches this point - if I open the text file in Notepad it displays the odd characters - ideally they should be correct before exporting, if that makes sense.I tried the `iconv()` code but it reported an invalid character error - possibly the umlaut?!
Matt Andrews
You can't really trust Notepad: it doesn't what the encoding is either, it's guessing. If you see “Chloë” in Notepad chances are you've output it correctly in UTF-8 but Notepad is guessing that it's code page 1252 (the system default code page or “ANSI” on Western machines). Get a better test editor (eg. Notepad++), or if you want to be absolutely sure what you've got, view it in a hex editor (eg. XVI32) that will show you every damn byte.
bobince
If `iconv('UTF-8', 'UTF-16', $s)` says invalid character then you've got *something* in there that's not a UTF-8 sequence. However judging by “Chloë” I do think it's likely you're getting UTF-8 in general. Maybe you're adding the database string to a string you've made in PHP that has non-UTF-8 sequences, because eg. you've written `"ä"` in the PHP source and saved it from Notepad as ANSI instead of UTF-8-no-BOM. (Again: better text editor.) If you can't track it down, try cleaning the UTF-8 string using a valid-UTF-8-regexp before use. Or I think `mb_convert_encoding` may ignore the errors?
bobince
Feel like I'm getting somewhere now - you're right, I'm adding other strings to it. Basically, I have a 'template' file which has the skeleton of my InDesign code, with the relevant database fields slotted in. I tried using `mb_convert_encoding` first, then converting to UTF-16 - this resulted in a file that my text editor displayed solely as squares, but Notepad++ (good rec!) displayed it okay. It also had the umlaut in place, hooray! Having problems getting InDesign to recognise it though, so having a play now.
Matt Andrews
Ah, yeah, I think InDesign will export as `ANSI-WIN` by default, so if you're using a template file based on that you'll have non-UTF-8 sequences. Unfortunately you can't use PHP itself to template a UTF-16 file here because PHP is only compatible with ASCII-superset encodings. (In general, as an encoding that isn't a superset of ASCII, UTF-16 is a poor and unusual choice for text files, but it seems that's the only possibility InDesign has for handling non-ASCII characters consistently.)
bobince
+1  A: 

Looks like an ISO-8859-1 string is sent as UTF-8...

Make sure your table and fields are in UTF-8 and connect to the database in UTF-8 too. If your table and fields are in UTF-8 and you don't specify the MySQL charset, MySQL will convert on the fly data to ISO-8859-1 (latin1) - thats the default configuration for all the hosts I've used so far...

This is the way I use to do this (back compatible with PHP 5.2.2 and less):

$conn = mysql_connect('localhost', 'user', 'pass');
mysql_select_db('dbname');
if (mysql_errno())
{
    //Handle database connection error here
}

if (function_exists('mysql_set_charset'))
    mysql_set_charset('utf8', $conn); //PHP 5.2.3+ only
else
{
    if (mysql_query("SET character_set_results = 'utf8', character_set_client = 'utf8', character_set_connection = 'utf8', character_set_database = 'utf8', character_set_server = 'utf8'", $conn) === false)
    {
        //Unable to set database charset! Handle error here...
    }
}
AlexV
I tried this but to no avail :( I made sure the database AND the fields are set to UTF-8, and I'm connecting to it in UTF-8 too. This has me stumped.
Matt Andrews