views:

69

answers:

4

Quick & dirty Q: Can I safely assume that a byte of a UTF-8, UTF-16 or UTF-32 codepoint (character) will not be an ASCII whitespace character (unless the codepoint is representing one)?

I'll explain:

Say that I have a UTF-8 encoded string. This string contains some characters that take more than one byte to store. I need to find out if any of the characters in this string are ASCII whitespace characters (space, horizontal tab, vertical tab, carriage return, linefeed etc - Unicode defines some more whitespace characters, but forget about them).

So what I do is that I loop through the string and check if any of the bytes match the bytes that define whitespace characters. Take e.g. 0D (hex) for carriage return. Note that we are talking bytes here, not characters.

Will this work? Will there be UTF-8 codepoints where the first byte will be 0D and the second byte something else - and this codepoint does not represent a carriage return? Maybe the other way around? Will there be codepoints where the first byte is something weird, and the second (or third, or fourth) byte is 0D - and this codepoint does not represent a carriage return?

UTF-8 is backwards compatible with ASCII, so I really hope that it will work for UTF-8. From what I know of it, it might, but I don't know the details well enough to say for sure.

As for UTF-16 and UTF-32 I doubt it'll work at all, but I barely know anything about the details of these, so feel free to surprise me there...


The reason for this whacky question is that I have code checking for whitespace that works for ASCII, and I need to know if it may break on Unicode. I have no choice but to check byte-for-byte, for a bunch of reasons. I'm hoping that the backwards compatibility with ASCII might give me at least UTF-8 support for free.

+4  A: 

For UTF-8, yes, you can. All non-ASCII characters are represented by bytes with the high-bit set and all ASCII characters have the high bit unset.

Just to be clear, every byte in the encoding of a non-ASCII character has the high bit set; this is by design.

You should never operate on UTF-16 or UTF-32 at the byte level. This almost certainly won't work. In fact lots of things will break, since every second byte is likely to be '\0' (unless you typically work in another language).

Marcelo Cantos
For checking whitespace, every second byte being null does not really matter, but I guess I'll just drop it. UTF-8 is the one most widely used anyway, and is much better than nothing. Thank you. Answer accepted (you were first, so I guess that'd be most fair, though the other answers are good too). Good news.
oystein
The null bytes issues causes other problems, due to the fact that many ASCII-based functions interpret the null byte as the end-of-string marker. You won't avoid your particular problem, because sometimes the high byte will happen to be 0x20, which coincides with the space character.
Marcelo Cantos
Okay, so UTF-8 it is then :)
oystein
+2  A: 

Yes, but see caveat below about the pitfalls of processing non-byte-oriented streams in this way.

For UTF-8, any continuation bytes always start with the bits 10, making them greater than 0x7f, no there's no chance they could be mistaken for a ASCII space.

You can see this in the following table:

Range              Encoding  Binary value
-----------------  --------  --------------------------
U+000000-U+00007f  0xxxxxxx  0xxxxxxx

U+000080-U+0007ff  110yyyxx  00000yyy xxxxxxxx
                   10xxxxxx

U+000800-U+00ffff  1110yyyy  yyyyyyyy xxxxxxxx
                   10yyyyxx
                   10xxxxxx

U+010000-U+10ffff  11110zzz  000zzzzz yyyyyyyy xxxxxxxx
                   10zzyyyy
                   10yyyyxx
                   10xxxxxx

You can also see that the non-continuation bytes for code points outside the ASCII range also have the high bit set, so they can never be mistaken for a space either.

See wikipedia UTF-8 for more detail.

UTF-16 and UTF-32 shouldn't be processed byte-by-byte in the first place. You should always process the unit itself, either a 16-bit or 32-bit value. If you do that, you're covered as well. If you process these byte-by-byte, there is a danger you'll find a 0x20 byte that is not a space (e.g., the second byte of a 16-bit UTF-16 value).

For UTF-16, since the extended characters in that encoding are formed from a surrogate pair whose individual values are in the range 0xd800 through 0xdfff, there's no danger that these surrogate pair components could be mistaken for spaces either.

See wikipedia UTF-16 for more detail.

Finally, UTF-32 (wikipedia link here) is big enough to represent all of the Unicode code points so no special encoding is required.

paxdiablo
But for UTF-16, the non-extended range would still consist of several bytes, could not one of these be 0D? Same for UTF-32?
oystein
@oystein, yes, that's why I said you shouldn't process them byte-by-byte - clarified.
paxdiablo
Sorry, I don't have a choice, but thanks for clarifying.
oystein
@oystein, no problems, the bottom line is that what you're proposing is safe for UTF-8 but not for the other two encodings. But I'm not sure I understand your reluctance, most C compilers would have a native 16-bit and 32-bit data type that you could use, with very little sacrificed in speed. However, you know more about your requirements and constraints than I do, so I won't try to second-guess you.
paxdiablo
oystein
@paxdiablo: There's no doubt that a surrogate pair can't be confused with whitespace but that's irrelevant when scanning byte by byte; 0x20 can be found as the low-order byte of a surrogate. In general all code points U+20xx and U+xx20 will be caught by a scan for a space likewise U+0Axx and U+xx0A will cause a line feed to be detected incorrectly. The byte-by-byte scan for whitespace is utterly useless for UTF-16 and UTF-32. BTW, GB18030 is a UTF, and the bytewise scan will work :-)
John Machin
+4  A: 

In correctly encoded UTF-8, all ASCII characters will be encoded as one byte each, and the numeric value of each byte will be equal to the Unicode and ASCII code points. Furthermore, any non-ASCII character will be encoded using only bytes that have the eighth bit set. Therefore, a byte value of 0D will always represent a carriage return, never the second or third byte of a multibyte UTF-8 sequence.

However, sometimes the UTF-8 decoding rules are abused to store ASCII characters in other ways. For example, if you take the two-byte sequence C0 A0 and UTF-8-decode it, you get the one-byte value 20, which is a space. (Any time you find the byte C0 or C8, it's the first byte of a two-byte encoding of an ASCII character.) I've seen this done to encode strings that were originally assumed to be single words, but later requirements grew to allow the value to have spaces. In order to not break existing code (which used stuff like strtok and sscanf to recognize space-delimited fields), the value was encoded using this bastardized UTF-8 instead of real UTF-8.

You probably don't need to worry about that, though. If the input to your program uses that format, then your code probably isn't meant to detect the specially encoded whitespace at that point anyway, so it's safe for you to ignore it.

Rob Kennedy
Yes, that will be the user's responsibility - not mine.
oystein
A: 

It is strongly suggested not to work against bytes when dealing with Unicode. The two major platforms (Java and .Net) support unicode natively and also provide a mechanism for determining these kind of things. For e.g. In Java you can use Character class's isSpace()/isSpaceChar()/isWhitespace() methods for your use case.

Pangea
Bah, Java :) I'm afraid I'm in the dark oblivion of some pretty nasty low-level-C++ code here, so I'm kind of on my own. If I had other options I'd probably grab them - fast.
oystein
then you should be using the i18n library http://icu-project.org/apiref/icu4c/
Pangea
oystein
@Pangea: If Java truly supported Unicode natively, you'd think its `char` could (always) hold one. But it can't, so it's all just a terrible kludge.
tchrist