views:

1600

answers:

2

Hi,

I can access the database either from a .NET program (using ODBC) or through a database management tool (written in Java).

If I write a 'é' character to the database from the .NET program, it appears as 'Õ' (capital O with tilde) in the DB management tool.

If I write a 'é' character to the database from the DB management tool, it appears as 'Å' (capital A with a circle on top) in the .NET program.

I am not trying to actually solve the problem (i.e. having both programs show the same thing), although that would be nice. I am merely trying to guess which character sets each is using to interpret the data, so that I can do the conversion myself if I dump data using .NET and re-input it using the tool.

So, which combination of 2 character sets would give the character mismatches described above?

Thanks for your help.

EDIT: using Sybase ASE 12.5

EDIT: basically the question is: do you know of a character encoding whose E9 code point represents character 'Õ' (capital O with tilde) or 'Å' (capital A with a circle on top)? (this supposes one of them is using Latin 1, hence the E9, which I think is pretty likely)

EDIT: Paul's solution does it. The answer about the charset is: hp-roman8

A: 

I would guess they are using different text encoding schemes. Read this.

(If you don't know about character encodings and Unicode, please read my article on the subject first.)

As stated at the start of the article, strings are always in Unicode encoding. The idea of "a Big-5 string" or "a string in UTF-8 encoding" is a mistake (as far as .NET is concerned) and usually indicates a lack of understanding of either encodings or the way .NET handles strings. It's very important to understand this - treating a string as if it represented some valid text in a non-Unicode encoding is almost always a mistake.

Now, the Unicode coded character set (one of the flaws of Unicode is that the one term is used for various things, including a coded character set and a character encoding scheme) contains more than 65536 characters. This means that a single char (System.Char) cannot cover every character. This leads to the use of surrogates where characters above U+FFFF are represented in strings as two characters. Essentially, string uses the UTF-16 character encoding form. Most developers may well not need to know much about this, but it's worth at least being aware of it.

kenny
I do not see how that has anything to do with the question asked. I know what Unicode is and how it works, but the data sent by the database is obviously being sent as bytes, which get interpreted differently by the .NET framework and the DB tool when converting them back to characters.
Laurent
+1  A: 

Sybase automatically tries to do a conversion if there are different charactersets being used on the server and the client. If you turn the automatic charset conversion off using,

set char_convert off

do you still get the same 'Õ' and 'Å''s?

Paul Owens
Brilliant! it worked! I'm still curious as to what on earth sybase thought it was converting the text to, but that solves the problem! Thanks!
Laurent