views:

270

answers:

2

I have to read a text file which is Unicode with UTF-8 encoding and have to write this data to another text file. The file has tab-separated data in lines.

My reading code is C++ code without unicode support. What I am doing is reading the file line-by-line in a string/char* and putting that string as-is to the destination file. I can't change the code so code-change suggestions are not welcome.

What I want to know is that while reading line-by-line can I encounter a NULL terminating character ('\0') within a line since it is unicode and one character can span multiple bytes.

My thinking was that it is quite possible that a NULL terminating character could be encountered within a line. Your thoughts?

+1  A: 

Very unlikely: all the bytes in an UTF-8 escape sequence have the higher bit set to 1.

Maurice Perry
Maurice, is there any reference supporting this statement?
Aamir
See CsTamas's message
Maurice Perry
See http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
Jon Skeet
@aamir: Check the Unicode standard. It has very detailed description of how code points are encoded in UTF-8.
Martin York
+10  A: 

UTF-8 uses 1 byte for all ASCII characters, which have the same code values as in the standard ASCII encoding, and up to 4 bytes for other characters. The upper bits of each byte are reserved as control bits. For code points using more then 1 byte, the control bits are set.

Thus there shall not be 0 character in your UTF-8 file.

Check Wikipedia for UTF-8

CsTamas
In fact, UTF-8 was specifically designed so that this would be true because it is useful to have an encoding in which the ASCII range are stored in one byte each, and which works in a sensible way when passed to `strcpy()` and its friends.
RBerteig