ansaurus

Question

Newline control characters in multi-byte character sets

Answer 1

+4 A:

All of those character sets are identical to ASCII for the first 128 code points--that is, they only use one byte to encode ASCII characters, including CR (0x0D) and LF (0x0A). You shouldn't have any problem.

Alan Moore 2009-04-07 06:37:00

I am worried that even though ASCII stays the same, the second byte of a multi-byte character could also look like ASCII. Or are the extra bytes all from the "upper half" ?

Thilo 2009-04-07 07:03:42

At least for UTF-8, that seems to be the case: every "second" byte looks like '10xx xxxx'.

Thilo 2009-04-07 07:06:02

In Shift-JIS, the second byte won't necessarily have the high-order bit set, but it looks like the minimum value it can have is 0x40. In EUC-JP, the second and third bytes will always be 0x80 or higher.

Alan Moore 2009-04-07 07:45:44

Seems I am safe then, except for ISO-2022-JP ...

Thilo 2009-04-07 07:51:26

Who downvoted me? If it was because of ISO-2022-JP, that wasn't part of the question when I answered it, but (as the other two responders have pointed out) it's no more a problem than the other encodings.

Alan Moore 2009-04-09 04:43:28

Never mind, there's an anonymous downvoter in every crowd :-( Actually, there's no problem with CR or LF on any encoding that I'm aware of, outside the EBCDIC (IBM mainframe computers) world -- and you don't want to go there :-) ISO-2022-JP switches happily between the JIS character set and ASCII, definitely no problem with CR/LF.

John Machin 2010-04-05 02:42:17

Answer 2

+1 A:

ISO-2022-JP uses Shift-In/Shift-Out to assign different meanings to the 94 printable ASCII characters, leaving the control characters including CR and LF untouched.

MSalters 2009-04-07 14:11:43

Answer 3

+1 A:

None of the 4 encodings that you mention (Shift-JIS, UTF-8, EUC-JP, ISO-2022-JP) use the CR or LF character inside Japanese characters. For UTF-8 and EUC-JP, there is no overlap whatsoever between low ascii characters and bytes inside Japanese characters. However, for Shift-JIS, and ISO-2022-JP, there is overlap, but not in the range where you find CR and LF.

For ISO-2022-JP,
First-byte range: 0x21 - 0x7E
Second-byte range: 0x21 - 0x7E

And the escape sequence characters to switch back and forth between various character sets are:

0x1B, 0x28, 0x24, 0x40, 0x42, and 0x4A

As you can see, none of the characters used to encode Japanese characters in ISO-2022-JP overlap with CR or LF.

For Shift-JIS,
First-byte range: 0x81 - 0x9F, 0xE0 - 0xEF
Second-byte range: 0x40 - 0x7E, 0x80 - 0xFC
Half-width katakana: 0xA1 - 0xDF

Again, there is no overlap with CR and LF.

保田ジェフリー 2009-04-08 19:47:18

ansaurus

tags:

views:

answers:

Newline control characters in multi-byte character sets

related questions