tags:

views:

1054

answers:

5

Hi,

I'm trying to get some legacy code to display Chinese characters properly. One character encoding I'm trying to work with starts with a 0x7F and is 4 bytes long (including the 0x7F byte). Does anyone know what kind of encoding this is and where I can find information for it? Thanks..

UPDATE: I've also had to work with some Japanese encoding that starts every character with a 0xE3 and is three bytes long. It displays on my computer properly if I choose the Japanese locale in Windows, however, it doesn't display properly in our application. However, if any other locale other than Japanese is selected, I cannot even view the filenames properly. So I'm guessing this encoding is not Unicode. Anyone know what it is? Is it ANSI? Is it Shift JIS?

For the Chinese one, I've tested it with Unicode and UTF-8 characters and I'm getting the same pattern; 0x7F followed by three bytes. Are Unicode and UTF-8 the same?

A: 

It might be a valid unicode encoding, such as a utf-8 or UTF16 surrogate pair.

1800 INFORMATION
+1  A: 

You might want to look at chinese character encoding page on Wikipedia. The only encoding in there that I can see that is always 4 bytes is UTF-32.

GB 18030 is the current standard Chinese character set, but it can be 1 to 4 bytes long.

MaxVT
Cool thanks, i'll check this out.. :)
krebstar
+6  A: 

One character encoding I'm trying to work with starts with a 0x7F and is 4 bytes long

What are the other bytes? Do you have any Latin text in this encoding?

If it's “0x7f 0x... 0x00 0x00” you are looking at UTF-32LE. It could also be two UTF-16 (either LE or BE) characters.

Most East Asian encodings use 0x80-0xFF as lead bytes for non-ASCII characters; there is none I know of that would use a leading 0x7F as anything other than an ASCII delete.

ETA:

are there supposed to be Byte Order Marks?

There doesn't need to be a BOM if there is an out-of-band way of signalling that the encoding is ‘UTF-32LE’ (possibly one that is lost before it gets to you).

I've also had to work with some Japanese encoding that starts every character with a 0xE3 and is three bytes long.

That's surely UTF-8. Sequence 0xE3 0x... 0x... would result in a character between U+3000 and U+4000, which is where the hiragana/katakana live.

It displays on my computer properly if I choose the Japanese locale in Windows, however, it doesn't display properly in our application.

Then chances are your application is is one of the regrettable horde of non-Unicode-compliant apps, still using ‘A’(*) versions of the Win32 interfaces inside of the ‘W’-suffixed ones. Whether you can read in the string according to its real encoding is moot: a non-Unicode-compliant app will never be able to display an East Asian ideograph on a Western locale.

(*: named for “ANSI”, which is Windows's misleading term for “whatever the system codepage is set to at the moment”. That's why changing your locale affected it.)

ETA(2):

OK, cracked it. It's not any standardised encoding I've met before, but it's relatively easy to decipher if you assume the premise that Unicode code points are being encoded.

0x00-0x7E: plain ASCII
0x7F A B C: Unicode character

The character encoded in a Unicode escape can be calculated by taking the index in a key string of A, B and C and adding together:

A*0x1000 + B*0x40 + C

That is, it's a base-64 character set, but it's not the usual Base64 standard. A little experimentation gives a key string of:

.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz

The ‘.’ and ‘_’ characters are guesses, since none of the characters you posted uses them. We'd need more data to find out the exact string.

So, for example:

0x7F 3 u g
A=4 B=58 C=44
4*0x1000 + 58*0x40 + 44 = 0x4EAC
U+4EAC = 京

ETA(3):

Yeah, it should be easy to create a native Unicode string by sucking out each code point manually and joining as a character. Not quite sure what's available on whatever platform you're using, but any Unicode-capable platform should be able to make a string from codepoints simply (and hopefully without having to manually re-encode to UTF-16LE bytes).

I figured it must be Unicode codepoints by noticing that the three example characters had first escape-characters in the same general range, and in the same numerical order as their Unicode codepoints. The other two characters seemed to change randomly, so it was very likely a big-endian encoding of the code point, and probably a base-64 encoding as 6 is as many bits as you can get out of readable ASCII.

Standard Base64 itself starts with letters, which would put something starting with a number too far up to be in the Basic Multilingual Plane. So I started guessing with ‘0123456789ABCDEFG...’ which would be the other obvious choice of key string. That got resulting numbers that were close to the code points for the given characters, but a bit too low. Inserting an extra character at the start of the key string (so digit ‘0’ doesn't map to number 0) got one of the characters right and the other two very close; the one that was right had no lower-case letters, so to change only the lower-case letters I inserted another character between the upper and lower cases. This came up with the right numbers.

It's not guaranteed that this is actually right, but (apart from the arbitrary choice of inserted characters) it's very likely to be it.

bobince
It looks like that UTF32 you've mentioned, but I can't see byte order marks anywhere.. are there supposed to be any byte order marks?
krebstar
Damn, I'm really confused now. for the Chinese one, there is no latin text in the encoding. It looks like 0x7F 0x.. 0x.. 0x.. for every "letter". If I add "latin" text (by which I mean to say, ascii), it is simply added as an ascii byte ( <= 0x7F).
krebstar
crap my bad. windows does display the text properly.
krebstar
Can you give more complete examples of the 0x7F-ridden text? I know of no encoding in existance that combines ASCII single-bytes with 0x7F for a 4-byte escape sequence. Maybe the content has already been mauled through a bad charset conversion or something?
bobince
[0x3] 0x7f '' char [0x4] 0x33 '3' char [0x5] 0x75 'u' char [0x6] 0x67 'g' char [0x7] 0x7f '' char [0x8] 0x38 '8' char [0x9] 0x68 'h' char [0xa] 0x6a 'j' char [0xb] 0x7f '' char [0xc] 0x37 '7' char [0xd] 0x45 'E' char [0xe] 0x52 'R' chartranslates to: 京魯菜
krebstar
sorry i didn't notice you'd already replied..anyway that's the byte sequence of the characters and what it comes out to.. I need to know what encoding it is in.. :(
krebstar
I don't think it is bad, because the app can display it properly..
krebstar
Oh wow, you are the greatest..! So now I just need to write a function to convert UTF-8 to this format?
krebstar
Just curious, how'd you figure it out? You know, multiplying A by 0x1000 and B by 0x40? Where'd you get these numbers?
krebstar
+1 for sheer effort
Orion Edwards
Damn thanks man, I owe you one! :D
krebstar
Hmm, I just tested this out because the vendor's solution had some drawbacks.. I don't get it, how is '3' (binary 0x33) = A = 0x4? I don't get 4EAC from ' (0x33 * 0x1000) + (0x75 * 0x40) + 0x75 ' ... Maybe I'm doing something wrong here..
krebstar
‘3’ counts as 4 because it's the character at index 4 in the key string. Similarly, ‘u’ is at index 58 in the key string and ‘g’ is at index 44 (ie. it is the forty-fifth character). This leaves (4*0x1000)+(58*0x40)+44=0x4EAC.
bobince
A: 

Yes, the Chinese one is UTF-8, a implementation (encoding) of Unicode. The UTF-8 is 1 byte long for ASCII characters and up to 4 bytes for others.

GogaRieger
+1  A: 

Try chardet. It does a good job of guessing the character encoding of a string of bytes.

Are Unicode and UTF-8 the same?

No. UTF-8 is just one way to represent Unicode characters as a sequence of bytes. Unicode is the full standard, assigning numeric and human-readable identifiers to each character, as well as lots of metadata about the characters.

Matt Good