views:

1527

answers:

2

I have a C#.Net application that accesses data from a commercial application backed by an Oracle 10 db. A couple of fields in the commercial app's database (declared as varchar2(n)) contain special characters. The "smart quote" apostrophe, for example. The commercial client app displays these characters correctly, but my application is displaying them as an inverted question mark. The Oracle character set is "WE8ISO8859P1".

My application reads the commercial database using System.Data.OracleClient.OracleDataAdapter, converted into a table via DataSet.Tables. The tablerows are converted into objects, and the fields in question are stored as strings.

If I examine (in the debugger) the data in the DataSet immediately after reading it from the db, and the special characters are already displayed incorrectly. I can't figure out how to examine the data as hex bytes to see what's really there, nor am I certain what I should be looking for.

I have also noted that Toad displays the characters as inverted question marks as well.

One aspect of our application writes these records to a separate table in our own database; when that occurs the special characters get modified, and subsequently display as boxes instead of upside-down question marks.

I can provide further information if needed. Thank you for any and all help!

+2  A: 

Certain characters in the WE8ISO8859P1 character set have a different binary representation than the same character in UTF8.

What I suggest are 2 possible ways

1) Try using Oracle native data providers for .NET (ODP.NET). May be there is a bug/feature in Microsoft's library System.Data.OracleClient that this adapter do not automatically support converting WE8ISO8859P1 to unicode. Here is a link to ODP.NET

I hope that there will be a support for this encoding in ODP (but to say true I never checked this, it is only a suggestion)

2) Workaround: in Dataset, you should create a binary field (mapped to the original table field) and a String field (not mapped to the database). When you load data to the dataset, iterate for each row and perfrom convertion from binary array to string.

Code should be something like this

Encoding e = Encoding.GetEncoding("iso-8859-1");
foreach(DataRow row in dataset.Tables["MyTable"])
{
    if (!row.IsNull("MyByteArrayField"))
        row["MyStringField"] = e.GetString((row["MyByteArrayField"] as byte[]));
}
Bogdan_Ch
Bogdan, thanks for your suggestions! I tried both paths, and while I don't have a solution yet, I do have more information. We are reading the strings as non-Unicode, and the special characters in question are coming through as three-byte characters. Depending on how we translate the string, the three bytes get interpreted any number of ways.The DB link to the remote database went down yesterday afternoon, and we're still working to fix that. When I have access again, I plan to try to read the string as unicode and attempt to translate the special character again.
RobLinx
@RobLinx thanks for accepting my answer, I appreciate this :) However I need to inform other possible reader of this question that resolution was different from mine (and you pointed it out in your comments). Perhaps you can answer this question yourself and accept your own answer and earn a "Self Learner" badge if other users will vote :).
Bogdan_Ch
A: 

Postscript for anyone browsing this thread:

Bogdan was very helpful in getting me to the "answer" (such as it is) but as he points out, you might not have identical circumstances.

  1. We communicated with the team responsible for using the commercial software. They had been copying/pasting from Word and Excel, which is how the special characters had been getting inserted.

  2. The problem occurred in the translation of the character between the remote database and our database. Host database uses character set WE8ISO8859P1, where ours uses WE8MSWIN1252. Due to corporate-level concerns, modifying either character set is not feasible right now.

  3. I used SYS.UTL_RAW.CAST_TO_RAW(fieldname) to convert the source field to search for 'BF' (the hex code for an inverted question mark in our character set). This at least let me identify the problem record and character. HOWEVER, many different special characters on the remote records would/could be translated to BF. For example, Word's hyphens are not simple "dash" characters, and also get translated to the inverted question mark.

  4. dump(fieldname) somehow converts to decimal character codes BEFORE the translation, UNLESS I also used the SYS.UTL_RAW.CAST_TO_RAW in the same query. This caused amazing headaches. dump() by itself could be useful in identifying specific pretranslated characters from the source db.

Best solution would be to use the same character set on both dbs. Since that's not possible for us, we have manually replaced all occurrences of the special character on the source (remote) db with non-special equivalents (regular apostrophe or hyphen). However, since the commercial software doesn't correct or flag special characters, we may run into this problem in the future. So, our update application will scan for the inverted question mark and send a notification to the system owner with the ID of the bad record. This, like so many other corporate situations, will have to do. ;-)

Thanks again, Bogdan!

RobLinx