ansaurus

Question

Checking Unicode string for whitespace - byte for byte!

Answer 1

+4 A:

For UTF-8, yes, you can. All non-ASCII characters are represented by bytes with the high-bit set and all ASCII characters have the high bit unset.

Just to be clear, every byte in the encoding of a non-ASCII character has the high bit set; this is by design.

You should never operate on UTF-16 or UTF-32 at the byte level. This almost certainly won't work. In fact lots of things will break, since every second byte is likely to be '\0' (unless you typically work in another language).

Marcelo Cantos 2010-10-29 23:57:03

For checking whitespace, every second byte being null does not really matter, but I guess I'll just drop it. UTF-8 is the one most widely used anyway, and is much better than nothing. Thank you. Answer accepted (you were first, so I guess that'd be most fair, though the other answers are good too). Good news.

oystein 2010-10-30 00:15:23

The null bytes issues causes other problems, due to the fact that many ASCII-based functions interpret the null byte as the end-of-string marker. You won't avoid your particular problem, because sometimes the high byte will happen to be 0x20, which coincides with the space character.

Marcelo Cantos 2010-10-30 00:32:56

Okay, so UTF-8 it is then :)

oystein 2010-10-30 00:35:01

Answer 2

+2 A:

Yes, but see caveat below about the pitfalls of processing non-byte-oriented streams in this way.

For UTF-8, any continuation bytes always start with the bits 10, making them greater than 0x7f, no there's no chance they could be mistaken for a ASCII space.

You can see this in the following table:

Range              Encoding  Binary value
-----------------  --------  --------------------------
U+000000-U+00007f  0xxxxxxx  0xxxxxxx

U+000080-U+0007ff  110yyyxx  00000yyy xxxxxxxx
                   10xxxxxx

U+000800-U+00ffff  1110yyyy  yyyyyyyy xxxxxxxx
                   10yyyyxx
                   10xxxxxx

U+010000-U+10ffff  11110zzz  000zzzzz yyyyyyyy xxxxxxxx
                   10zzyyyy
                   10yyyyxx
                   10xxxxxx

You can also see that the non-continuation bytes for code points outside the ASCII range also have the high bit set, so they can never be mistaken for a space either.

See wikipedia UTF-8 for more detail.

UTF-16 and UTF-32 shouldn't be processed byte-by-byte in the first place. You should always process the unit itself, either a 16-bit or 32-bit value. If you do that, you're covered as well. If you process these byte-by-byte, there is a danger you'll find a 0x20 byte that is not a space (e.g., the second byte of a 16-bit UTF-16 value).

For UTF-16, since the extended characters in that encoding are formed from a surrogate pair whose individual values are in the range 0xd800 through 0xdfff, there's no danger that these surrogate pair components could be mistaken for spaces either.

See wikipedia UTF-16 for more detail.

Finally, UTF-32 (wikipedia link here) is big enough to represent all of the Unicode code points so no special encoding is required.

paxdiablo 2010-10-30 00:07:10

But for UTF-16, the non-extended range would still consist of several bytes, could not one of these be 0D? Same for UTF-32?

oystein 2010-10-30 00:19:43

@oystein, yes, that's why I said you shouldn't process them byte-by-byte - clarified.

paxdiablo 2010-10-30 00:22:44

Sorry, I don't have a choice, but thanks for clarifying.

oystein 2010-10-30 00:27:01

@oystein, no problems, the bottom line is that what you're proposing is safe for UTF-8 but not for the other two encodings. But I'm not sure I understand your reluctance, most C compilers would have a native 16-bit and 32-bit data type that you could use, with very little sacrificed in speed. However, you know more about your requirements and constraints than I do, so I won't try to second-guess you.

paxdiablo 2010-10-30 00:35:20

oystein 2010-10-30 00:41:49

@paxdiablo: There's no doubt that a surrogate pair can't be confused with whitespace but that's irrelevant when scanning byte by byte; 0x20 can be found as the low-order byte of a surrogate. In general all code points U+20xx and U+xx20 will be caught by a scan for a space likewise U+0Axx and U+xx0A will cause a line feed to be detected incorrectly. The byte-by-byte scan for whitespace is utterly useless for UTF-16 and UTF-32. BTW, GB18030 is a UTF, and the bytewise scan will work :-)

John Machin 2010-10-30 06:29:45

Answer 3

+4 A:

In correctly encoded UTF-8, all ASCII characters will be encoded as one byte each, and the numeric value of each byte will be equal to the Unicode and ASCII code points. Furthermore, any non-ASCII character will be encoded using only bytes that have the eighth bit set. Therefore, a byte value of 0D will always represent a carriage return, never the second or third byte of a multibyte UTF-8 sequence.

However, sometimes the UTF-8 decoding rules are abused to store ASCII characters in other ways. For example, if you take the two-byte sequence C0 A0 and UTF-8-decode it, you get the one-byte value 20, which is a space. (Any time you find the byte C0 or C8, it's the first byte of a two-byte encoding of an ASCII character.) I've seen this done to encode strings that were originally assumed to be single words, but later requirements grew to allow the value to have spaces. In order to not break existing code (which used stuff like strtok and sscanf to recognize space-delimited fields), the value was encoded using this bastardized UTF-8 instead of real UTF-8.

You probably don't need to worry about that, though. If the input to your program uses that format, then your code probably isn't meant to detect the specially encoded whitespace at that point anyway, so it's safe for you to ignore it.

Rob Kennedy 2010-10-30 00:10:31

Yes, that will be the user's responsibility - not mine.

oystein 2010-10-30 00:12:18

Answer 4

A:

It is strongly suggested not to work against bytes when dealing with Unicode. The two major platforms (Java and .Net) support unicode natively and also provide a mechanism for determining these kind of things. For e.g. In Java you can use Character class's isSpace()/isSpaceChar()/isWhitespace() methods for your use case.

Pangea 2010-10-30 00:21:01

Bah, Java :) I'm afraid I'm in the dark oblivion of some pretty nasty low-level-C++ code here, so I'm kind of on my own. If I had other options I'd probably grab them - fast.

oystein 2010-10-30 00:26:24

then you should be using the i18n library http://icu-project.org/apiref/icu4c/

Pangea 2010-10-30 00:35:15

oystein 2010-10-30 00:39:11

@Pangea: If Java truly supported Unicode natively, you'd think its `char` could (always) hold one. But it can't, so it's all just a terrible kludge.

tchrist 2010-10-31 02:16:13

ansaurus

tags:

views:

answers:

Checking Unicode string for whitespace - byte for byte!

related questions