views:

162

answers:

3

Hi,

I need to convert a bunch of bytes in ISO-2022-JP and ISO-2022-JP-2 (and other variations of ISO-2022) into Unicode. I am trying to use ICU (link text), but the following code doesn't work.

std::string input = "\x1B\x28\x4A" "ABC\xA6\xA7";    //the first 3 chars are escape sequence to use JIS_X201 character set in GL/GR
UErrorCode status = U_ZERO_ERROR;
UConverter *conv;
// set up the converter
conv = ucnv_open("ISO-2022-JP", &status);
if (status != U_ZERO_ERROR) return false;   //couldn't find character set

UChar * convDest = new UChar[2*input.length()]; //ucnv_toUChars will use up to 2*length

// convert to Unicode
int resultLen = (int)ucnv_toUChars(conv, convDest, 2*input.length(), input.c_str(), input.length(), &status);

This doesn't work. The result contains '?' charcters for anything I put in that was above ASCII. The status has no error. What am I doing wrong?

On top of that I was having trouble compiling the library ver 4.4 as the MSVC 9 project would not convert to MSVC 10 project.

I am also aware of libiconv open source library. I couldn't compile that one on windows. If anyone has any advice on a different library, that's also welcome.

Thanks.

EDIT The escape sequence I originally used was wrong. So now ICU takes the string, strips out the escape sequence - which is a step in the right direction. But the result still contains '?' chars.

EDIT2 The reason I couldn't convert to MSVC 10 project was because x64 platform wasn't installed (it isn't by default). Alternatively I could open all the projects in text editor and remove all mention of x64 target.

+2  A: 

This doesn't resemble an ISO 2022 encoding. The high bits are supposed to be zero. The escape sequence looks somewhat recognizable, but it starts with ESC. 0x1b, not 0xb0. No idea what those byte values really mean.

Hans Passant
You're right. The escape sequence was wrong. Stupid mistake. However, I thought the input string was correct for ISO 2022. The standard supports 8 byte encoding - that's why you have GL and GR planes. Also looking at http://en.wikipedia.org/wiki/ISO/IEC_2022 for ISO-2022-JP, given that escape sequence it should bind http://en.wikipedia.org/wiki/JIS_X_0201 character set, which does map the higher bytes. I'm using this reference for ISO-2022: http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-035.pdf section 8 describes 8-bit codes.
Budric
It is a complete bugger of an encoding, about the worst I've ever seen. It is extraordinarily sensitive to decoder state, do make sure you use *real* data from a known-good source. One way to get that if you don't have good data is to *encode* what you expect to see first, then push that back to the decoder.
Hans Passant
I agree completely. An absolute nightmare to deal with. I will try to make sure my input is good.
Budric
What is it about ISO that has the suck dial at eleven *all the farking time*. Completely dominated by proprietary powers, somebody needs to put them out of business. Go Ecma! Oh, and the really cool people in the Unicode group. High time for a Nobel there. Ahem, `</editorial>`
Hans Passant
+1  A: 

(This question looks familiar, Hi again.)

A minor, minor nit: You want to check the error status with if(U_FAILURE(status)) (or conversely, U_SUCCESS(status)).

Steven R. Loomis
+1  A: 

I couldn't get the conversion to work for JIS_X201 character set in ISO-2022-JP encoding. And I couldn't generate a "valid" one using any tools at my disposal - tried Java (ICU and non ICU implementation of ISO2022) and C++.

So I basically just wrote a function to do a code lookup and convert to Unicode using this table: wikipedia.

EDIT As I started filling out the bug report I wanted to include the RFC for ISO-2022-JP. Then I found this line in the RFC "The Kana set of JIS X 0201 is not used in ISO-2022-JP messages." link text. So it appears that the standard doesn't actually define the upper bits. The ISO-2022-JP-3 WILL map the upper bits, but to lower plane. So I have to take each byte and subtract 0x80 from it, and pass it through ISO-2022-JP-3, and take the other bytes < 128 and pass them through ISO-2022-JP converter for full JIS_X201 character set. Well it's a lot easier to just do it myself.

So strictly speaking I would say it's not a bug. It's a huge headache though.

P.S. the whole messed up stream that I'm trying to decode comes from DICOM. See pdf page 107 to see what they consider acceptable.

Budric
Very interesting. Did you try converting from 2022 to Unicode using Java (non ICU)?
Steven R. Loomis
I tried: Charset iso2022JP = Charset.forName("ISO-2022-JP"); CharBuffer result = iso2022JP.decode(ByteBuffer.wrap(bytes)); where the Charset was sun.nio.cs.ext.ISO2022_JP and the results are the same as from ICU. It's basically not mapping the chars above 128.
Budric
Budric could you file a bug on ICU at http://bugs.icu-project.org/trac/newticket and mention these findings? Really strange that neither of these support this in the encoding.
Steven R. Loomis
Well...since the other encoder does the same thing it may not be a bug. See edit above.
Budric
It's not a bug. This input sequence: "\x1B\x28\x4A" "ABC\x1b\x28\x49\x26\x27"Produces: 0041 0042 0043 FF66 FF67I added "<ESC>(I" to shift into the Katakana range, and stripped off the high bits ( A6/A7 -> 26/27)The wikipedia page you linked states "For example, the most significant bit of each byte does not carry any meaning; this allows ISO/IEC 2022 (like ISO/IEC 646) to be easily transmitted through 7-bit communication channels." as well as Ken Lunde's CJKV book which states the same.Jis X 201 is used, but only the lower 128 values of it.
Steven R. Loomis
I looked at the PDF but didn't see any text with the high bit set after the ESC(J …
Steven R. Loomis