tags:

views:

337

answers:

4

How can I convert a JIS X 208 encoded string into UNICODE in C++? A VC++ specific answer would be helpful.

The bigger problem that I am finding difficulty in understanding is that there are too many encodings for Japanese characters. JIS itself has many versions, then there is Shift-JIS. It would be great if some one could point towards a good explanation of these in English.

I looked through code page identifiers in MSDN. This does list Japanese (JIS 0208-1990 and 0121-1990) but I am wondering whats the difference between JIS 0208 and JIS X 0208.

+1  A: 

The ICU project contains many functions for converting from and to Unicode. It'll work on most OS's, including Windows. It'll handle conversions to/from pretty much all the codepages out there.

From what I can see, JIS X 0208 and JIS 0208 appear to be 2 variations in the name for the same thing, i.e. the actual codepage is the same.

Here's the wikipedia article on JIS 0208, hopefully it'll answer some of your questions as it goes into more depth into the history of JIS and it's different versions

Glen
A: 

JIS X 0208 seems to be outdated and superseded by JIS X 0213.

Shift JIS is an encoding of JIS X, i.e. an algorithm to convert 16-bit character codes into 8-bit representation.

I found this mapping table from JIS to Unicode and this C converter from JIS X 0208 to Unicode.

Hope this helps.

devio
0213 is not a simple update to obsolete 0208. Its extensions clash with real-world-deployed supersets of 0208 such as Windows code page 932. Because of this (and because systems that want to Do It Right are moving to Unicode not anything JIS-related), there is little take-up of 0213 and its variant ‘Shift-JIS-2004’. If you meet ‘Shift-JIS’ in the real world it is probably really code page 932.
bobince
A: 

The X refers to the type of standard. All JIS standards have some classification, so "JIS 0208" is really just used as an abbreviation for "JIS X 0208".

Michael Madsen
+1  A: 

“JIS X 0208” is name of character set specification (i.e., it defines abstract shape of characters with character numbers). The spec. does not define how to encode (i.e., byte array representation of) the characters. (There're three major encodings for JIS X 0208; ISO-2022-JP, EUC-JP and Shift_JIS.)

So “JIS X 0208 encoded string” is ambiguous. If you mean “CP932 (which is most widely used variant of Shift_JIS) encoded string”, you may use MultiByteToUnicode() Win32API with CP932 as first argument.

JIS 0208 and JIS X 0208 may be same (latter is correct name of specification).

“0121-1990” in MSDN must be typo of “0212-1990”. It's also a character set specification which contains rarely-used (Kanji-)characters.

habe
Thanx a lot for the clarification.
Shailesh Kumar