Reading a UTF-8 Unicode file through non-unicode code.

views:

270

answers:

+2 Q:

Reading a UTF-8 Unicode file through non-unicode code.

I have to read a text file which is Unicode with UTF-8 encoding and have to write this data to another text file. The file has tab-separated data in lines.

My reading code is C++ code without unicode support. What I am doing is reading the file line-by-line in a string/char* and putting that string as-is to the destination file. I can't change the code so code-change suggestions are not welcome.

What I want to know is that while reading line-by-line can I encounter a NULL terminating character ('\0') within a line since it is unicode and one character can span multiple bytes.

My thinking was that it is quite possible that a NULL terminating character could be encountered within a line. Your thoughts?

+1 A:

Very unlikely: all the bytes in an UTF-8 escape sequence have the higher bit set to 1.

Maurice Perry 2009-07-30 06:03:34

Maurice, is there any reference supporting this statement?

Aamir 2009-07-30 06:07:50

See CsTamas's message

Maurice Perry 2009-07-30 06:16:19

See http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

Jon Skeet 2009-07-30 06:16:48

@aamir: Check the Unicode standard. It has very detailed description of how code points are encoded in UTF-8.

Martin York 2009-07-30 07:07:29

+10 A:

UTF-8 uses 1 byte for all ASCII characters, which have the same code values as in the standard ASCII encoding, and up to 4 bytes for other characters. The upper bits of each byte are reserved as control bits. For code points using more then 1 byte, the control bits are set.

Thus there shall not be 0 character in your UTF-8 file.

Check Wikipedia for UTF-8

CsTamas 2009-07-30 06:13:44

In fact, UTF-8 was specifically designed so that this would be true because it is useful to have an encoding in which the ASCII range are stored in one byte each, and which works in a sensible way when passed to `strcpy()` and its friends.

RBerteig 2009-07-30 06:22:32

ansaurus

tags:

views:

answers:

Reading a UTF-8 Unicode file through non-unicode code.

related questions