ansaurus

Question

C++ encode string to Unicode - ICU library

Answer 1

+2 A:

This doesn't resemble an ISO 2022 encoding. The high bits are supposed to be zero. The escape sequence looks somewhat recognizable, but it starts with ESC. 0x1b, not 0xb0. No idea what those byte values really mean.

Hans Passant 2010-09-15 20:34:40

You're right. The escape sequence was wrong. Stupid mistake. However, I thought the input string was correct for ISO 2022. The standard supports 8 byte encoding - that's why you have GL and GR planes. Also looking at http://en.wikipedia.org/wiki/ISO/IEC_2022 for ISO-2022-JP, given that escape sequence it should bind http://en.wikipedia.org/wiki/JIS_X_0201 character set, which does map the higher bytes. I'm using this reference for ISO-2022: http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-035.pdf section 8 describes 8-bit codes.

Budric 2010-09-15 20:56:26

It is a complete bugger of an encoding, about the worst I've ever seen. It is extraordinarily sensitive to decoder state, do make sure you use *real* data from a known-good source. One way to get that if you don't have good data is to *encode* what you expect to see first, then push that back to the decoder.

Hans Passant 2010-09-15 21:10:25

I agree completely. An absolute nightmare to deal with. I will try to make sure my input is good.

Budric 2010-09-15 21:18:25

What is it about ISO that has the suck dial at eleven *all the farking time*. Completely dominated by proprietary powers, somebody needs to put them out of business. Go Ecma! Oh, and the really cool people in the Unicode group. High time for a Nobel there. Ahem, `</editorial>`

Hans Passant 2010-09-15 21:38:30

Answer 2

+1 A:

(This question looks familiar, Hi again.)

A minor, minor nit: You want to check the error status with if(U_FAILURE(status)) (or conversely, U_SUCCESS(status)).

Steven R. Loomis 2010-09-16 01:03:54

Answer 3

+1 A:

I couldn't get the conversion to work for JIS_X201 character set in ISO-2022-JP encoding. And I couldn't generate a "valid" one using any tools at my disposal - tried Java (ICU and non ICU implementation of ISO2022) and C++.

So I basically just wrote a function to do a code lookup and convert to Unicode using this table: wikipedia.

EDIT As I started filling out the bug report I wanted to include the RFC for ISO-2022-JP. Then I found this line in the RFC "The Kana set of JIS X 0201 is not used in ISO-2022-JP messages." link text. So it appears that the standard doesn't actually define the upper bits. The ISO-2022-JP-3 WILL map the upper bits, but to lower plane. So I have to take each byte and subtract 0x80 from it, and pass it through ISO-2022-JP-3, and take the other bytes < 128 and pass them through ISO-2022-JP converter for full JIS_X201 character set. Well it's a lot easier to just do it myself.

So strictly speaking I would say it's not a bug. It's a huge headache though.

P.S. the whole messed up stream that I'm trying to decode comes from DICOM. See pdf page 107 to see what they consider acceptable.

Budric 2010-09-20 14:52:16

Very interesting. Did you try converting from 2022 to Unicode using Java (non ICU)?

Steven R. Loomis 2010-09-20 21:03:50

I tried: Charset iso2022JP = Charset.forName("ISO-2022-JP"); CharBuffer result = iso2022JP.decode(ByteBuffer.wrap(bytes)); where the Charset was sun.nio.cs.ext.ISO2022_JP and the results are the same as from ICU. It's basically not mapping the chars above 128.

Budric 2010-09-21 15:45:51

Budric could you file a bug on ICU at http://bugs.icu-project.org/trac/newticket and mention these findings? Really strange that neither of these support this in the encoding.

Steven R. Loomis 2010-09-21 19:04:16

Well...since the other encoder does the same thing it may not be a bug. See edit above.

Budric 2010-09-21 19:27:28

It's not a bug. This input sequence: "\x1B\x28\x4A" "ABC\x1b\x28\x49\x26\x27"Produces: 0041 0042 0043 FF66 FF67I added "<ESC>(I" to shift into the Katakana range, and stripped off the high bits ( A6/A7 -> 26/27)The wikipedia page you linked states "For example, the most significant bit of each byte does not carry any meaning; this allows ISO/IEC 2022 (like ISO/IEC 646) to be easily transmitted through 7-bit communication channels." as well as Ken Lunde's CJKV book which states the same.Jis X 201 is used, but only the lower 128 values of it.

Steven R. Loomis 2010-09-22 21:37:33

I looked at the PDF but didn't see any text with the high bit set after the ESC(J …

Steven R. Loomis 2010-09-22 21:45:13

ansaurus

tags:

views:

answers:

C++ encode string to Unicode - ICU library

related questions