views:

297

answers:

6

I am having great problems solving this one:

I have a mysql database encoding latin1_swedish_ci and a table that stores names and addresses.

I am trying to output a UTF-8 XML file, but I am having problems with the following string:

Otivägen it is being outputted as Otivägen when i vim the file. Also when opened it IE i get

"An invalid character was found in text content. Error processing resource"

I have the following code:

function fixEncoding($in_str)
{
    $cur_encoding = mb_detect_encoding($in_str) ;
    if($cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8"))
        return $in_str;
    else
        return utf8_encode($in_str);
}

header("Content-type: text/plain;charset=utf-8");
$mystring = "Otivägen" // this is actually obtained from database;

$myxml = "<myxml>
....
     <node>".$mystring."</node>
....
</myxml>
";
$myxml = fixEncoding($myxml);

The actual XML output is below:

<?xml version="1.0" encoding="UTF-8" ?>
<myxml>
    ....
    <node>Otivägen</node>
    ....
</myxml>

Any ideas how I can output the file so in vim the file reads Otivägen and not Otivägen?

EDIT:

I did mysql_client_encoding() and got latin1
I then did mysql_set_charset()
and again ran mysql_client_encoding() and got utf8, but still the same outputting issues.

Edit 2

I have logged into the command line and run the query SELECT address1 FROM address WHERE id = 1000;

SELECT address1 FROM address WHERE id = 1000;
Current database: ftpuser_db

+-------------+
|   address1  |
+-------------+
| Otivägen 32 |
+-------------+
1 row in set (0.06 sec)

Thanks in advance!

+1  A: 

Is your MySQL connection encoding properly set to UTF-8 ?

Check mysql_set_charset() and mysql_client_encoding() for more details.

Wookai
I did mysql_client_encoding() and got latin1, then i did mysql_set_charset() and then run mysql_client_encoding() again and got utf8, but still the same issue.
Lizard
Have you tried applying `fixEncoding()` separately on each `$myString`, instead of once on the whole `$myXml` ?
Wookai
yup i tried it `$mystring` but this didn't change anything
Lizard
Whata does `mb_detect_encoding()` give yon on `$myString` ?
Wookai
Actually outputs UTF-8
Lizard
A: 

latin1_swedish_ci is a collation, not a charset. Since collations are supposed to match their charset, it suggests that the table is using latin1, but it's not a guarantee.

Strictly speaking, the charset of tables is irrelevant here, since MySql can convert input/output. That's what the connection charset (mysql_set_charset) is for. However, for that to work properly, the data needs to be encoded properly in the database. I would begin by checking that strings are correct in the database. Simplest thing is to log in on the command line and select a row which has non-ascii characters in it. Does it look OK?

$mystring = "Otivägen" // this is actually obtained from database;

Watch out. The encoding of the data in $mystring will now depend on the encoding of the php file. That may or may not be the same as the data in the database.

troelskn
I have logged into the command line and run the query `SELECT ad_address1 FROM address WHERE id = 1000; and all outputted as expected. SO what should I be looking for now?
Lizard
A: 

before output run query SET NAMES utf8

after output you can go back and run SET NAMES latin1

Look here, I've got the same problem

Dan
Sorry this didn't work either :( no difference in output.
Lizard
A: 

It seems you are "double encoding" Otivägen. You get this behaviour if Otivägen already is UTF-8, and run utf8_encode() on it again. Example:

$str = "Otivägen"; // already an UTF-8 string
echo utf8_encode($str); // outputs Otivägen

I'm not sure we're the actual "double encoding" occurs, but it may be due to settings in your editor. My theory. Lets say you are running Aptana Studio: Your actual character set is set to ISO-8859-1 (in Aptana, you can check this by right clicking on a file and choose "properties". To set default character encoding for all projects, choose Preferences from Aptana main menu -> General -> workspace). If that's the case, the actual PHP source file where you have $myxml and its string <myxml><node>... is detected to be ISO-8859-1, but $mystring received from the database is UTF-8. Your fixEncoding function would then run the else clause, since the $myxml as a whole is seen as ISO-8859-1 and not UTF-8. This results in double encoding the results from the database, and may be the cause to your problem.

Check the encoding of your actual source file in your editor, and verify that it is set to UTF-8. Alternatively, experiment with applying or removing fixEncoding/utf8_encode/utf8_decode to $myxml. Observe the results and see what needs to be done to the value Otivägen right.

Mads Mobæk
Cool thanks will try this out
Lizard
+1  A: 

I think you did everything correctly, except that your terminal is in Latin-1.

The UTF-8 sequence for ä is C3 A4, which is ä if displayed as Latin-1.

ZZ Coder
A: 

Oh boy. UTF8 issues can be a real pain and they get almost impossible to solve when something is doing re-encodings for you.

You really need to start at one end and make sure every process is UTF8. That will remove things in the process from interpreting the data wrong and 'converting' it for you. But significantly, it will also let you much more easily spot when something has already mis-encoded text for you (yes, I've had that problem).

And if you have UTF8 data in tables that aren't set to UTF8 and might be mis-encoded, you need to do the tables last, after the data has been re-encoded. Otherwise you will damage your data irretrievably. I've had that problem, too.

First steps:

  • Check your terminal is UTF8 compliant. Gnome-terminal is. Kterm is. ETerm is not.
  • Check your LANG setting in your shell. It should probably have .UTF-8 on the end of it's value.
  • Check that vim is picking up the UTF8 setting correctly. You can check with :set encoding

This will mean that your files will be edited in UTF8.

Now we check MySQL.

In the MySQL CLI, do show variables like 'character_set%';. The results will probably be something like:

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     | 
| character_set_connection | latin1                     | 
| character_set_database   | latin1                     | 
| character_set_filesystem | binary                     | 
| character_set_results    | latin1                     | 
| character_set_server     | latin1                     | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+

What you're aiming for is to change all those latin1 values (or whatever you're seeing) to utf8.

set names utf8; will change most of them and you might need to do that with every new connection in your database. This was the solution I had to adopt in a previous application. The other settings to change are in the my.cnf file for which I need to direct you to the documentation. It is unlikely you will need to set them all.

I see you're already setting the output headers, so that's good.

Now you can look at the data from the database and see why it's "wrong".

staticsan