views:

59

answers:

2

Hi All,

I generated a SQL script from a C# application on Windows 7. The name entries have utf8 characters. It works find on Windows machine where I use a python script to populate the db. Now the same script fails on Linux platform complaining about those special characters.

Similar things happened when I generated XML file containing utf chars on Windows 7 but fails to show up on browsers (IE, Firefox.).

I used to generate such scripts on Windows XP and it worked perfect everywhere.

A: 

Assuming you're using python, make sure you are using Unicode strings.

For example:

s = "Hello world"          # Regular String
u = u"Hello Unicode world" # Unicdoe String

Edit:
Here's an example of reading from a UTF-8 file from the linked site:

import codecs
fileObj = codecs.open( "someFile", "r", "utf-8" )
u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file
advait
+1  A: 

Please give a small example of a script with "utf8 characters" in the "name entries". Are you sure that they are utf8 and not some windows encoding like `cp1252'? What makes you sure? Try this in Python at the command prompt:

... python -c "print repr(open('small_script.sql', 'rb').read())"

The interesting parts of the output are where it uses \xhh (where h is any hex digit) to represent non-ASCII characters e.g. \xc3\xa2 is the UTF-8 encoding of the small a with circumflex accent. Show us a representative sample of such output. Also tell us the exact error message(s) that you get from that sample script.

Update: It appears that you have data encoded in cp1252 or similar (Latin1 aka ISO-8859-1 is as rare as hen's teeth on Windows). To get that into UTF-8 using Python, you'd do fixed_data = data.decode('cp1252').encode('utf8'); I can't help you with C# -- you may like to ask a separate question about that.

John Machin
Here is the output produced from the command aboveINSERT INTO customer (id, name)VALUES (2,'Mic M\xfcnchen');\r\n .The actual name is "Mic München"Thank you.
That means it's not UTF-8. In UTF-8, `ü` would be `\xc3\xbc`. `\xfc` means it's in latin-1 or cp1252 or some other encoding (quite a few single-byte encodings use `\xfc` for that character.
Thomas Wouters
Exactly, I found it. The file seems to be in cp1252. Now how do I force the C# application to save the names in UTF-8 format? OR how do I change the Windows 7 encoding from cp1252 to utf-8?
I found the way to force file created in C# apps to be in utf-8. We need to set the encoding param to Encoding.UTF8 while doing new StreamWriter