views:

109

answers:

1

Problem is categorized in two steps:

Problem Step 1. Access 97 db containing XML strings that are encoded in UTF-8.

The problem boils down to this: the Access 97 db contains XML strings that are encoded in UTF-8. So I created a patch tool for separate conversion for the XML strings from UTF-8 to Unicode. In order to covert UTF8 string to Unicode, I have used function MultiByteToWideChar(CP_UTF8, 0, PChar(OriginalName), -1, @newName, Size);.(where newName is array as declared "newName : Array[0..2048] of WideChar;" ).

This function works good on most of the cases, I have checked it with Spainsh, Arabic, characters. but I am working on Greek and Chineese Characters it is choking.

For some greek characters like "Ευγ. ΚαÏαβιά" (as stored in Access-97), the resultant new string contains null charaters in between, and when it is stored to wide-string the characters are getting clipped.

For some chineese characters like "?¢»?µ?"(as stored in Access-97), the result is totally absurd like "?¢»?µ?".

Problem Step 2. Access 97 db Text Strings, Application GUI takes unicode input and saved in Access-97

First I checked with Arabic and Spainish Characters, it seems then that no explicit characters encoding is required. But again the problem comes with greek and chineese characters.

I tried the above mentioned same function for the text conversion( Is It correct???), the result was again disspointing. The Spainsh characters which are ok with out conversion, get unicode character either lost or converted to regular Ascii Alphabets.

The Greek and Chineese characters shows similar behaviour as mentined in step 1.

Please guide me. Am I taking the right approach? Is there some other way around??? Well Right now I am confused and full of Questions :)

+3  A: 

There is no special requirement for working with Greek characters. The real problem is that the characters were stored in an encoding that Access doesn't recognize in the first place. When the application stored the UTF8 values in the database it tried to convert every single byte to the equivalent byte in the database's codepage. Every character that had no correspondence in that encoding was replaced with ? That may mean that the Greek text is OK, while the chinese text may be gone.

In order to convert the data to something readable you have to know the codepage they are stored in. Using this you can get the actual bytes and then convert them to Unicode.

Panagiotis Kanavos
Actually, The application indeed uses the code Pages, i.e. as soon as the user slect specific language, the respective page code is used to encode the same. Problem is its stored in Access-97. I am not sure that while storing this encoding info is saved or lost.
Nains
I was referring to the codepage used in the database - unless you mean that the application stores strings using different encodings in the same field. What codepage are you using for the Greek characters?
Panagiotis Kanavos
Well, Application uses Win code page 1253 to interpret the Greek Characters from Access 97 back n forth. N u r suggesting to look for code page Database is referring. Ok I got ur point, n looking for this further.... Thanks..
Nains
@Panagiotis Kanavos: Lets say Database uses UTF-8 Code Page, I am storing CJK( Chineese, Japaneese, Koreain) large characters strings from application. The result will be wrong encoding such as this chineese character "???½¹«?·-?î¾®¶?". Now My Question: Is there any way to retrieve these characters sucessfully???
Nains
Finally I got the point :)
Nains