views:

1050

answers:

3

Hello everyone,

Suppose I have a byte stream (array), and I want to write code (using .Net C#) to validate whether it is valid UTF-8 byte sequence or not. I want to write code from scratch because I need to report the exact location where there is invalid byte sequences and may even remove invalid bytes -- not just want to get yes or no about whether the byte stream/array is valid.

Are there any sample codes to make reference? If no C# code, simple samples in C++/Java are also appreciated. Thanks!

For the invalid byte sequences of UTF-8, I mean

http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

thanks in advance, George

+1  A: 

Nice point. I didn't know that there exists non valid UTF-8 sequences.

The article at the wikiedia is a starting point but I don't think that you can have a complete test. Can you? I am interested

A complete test means that for every sequence you can have a function that answers yes or no for every possible sequence. A full function.

The point is what to do or to return if your sequence is not complete (a short sequence). As far as I know there are some editors that add a special character in order to fulfill it. May be you should handle such cases as invalid sequences and then your test will be complete. I wonder if this is the only case.

Anyway, I will put this question as a favourite in order to keep track of answers. Sure somebody will illuminate us.

Luixv
Confused. What do you mean test? :-)Actually I am a developed, I need simple samples which implement the same function to make reference. Appreciated if you could give me some reference samples. I have some invalid UTF-8 encoded samples to make test at hand.
George2
+3  A: 

What you need is DecoderFallback. When the Encoding class is trying to convert a sequence of bytes to the target encoding, you can specify fallback behaviour:

Using UTF8Encoding and DecoderReplacementFallback you can achieve just what you're looking for.

DreamSonic
@DreamSonic, when using the following code snippet to load XML docuemnt and check against whether it is UTF-8 encoded using your suggested solution above, how to add fallback functions? XmlDocument xDoc = new XmlDocument(); xDoc.Load("c:\\abc.xml");
George2
In this case you should use the XmlDocument.LoadXml(...) method instead of the Load(...). You should open the stream, read all its bytes and try to convert them to a given encoding using Encoding.GetString(...).
DreamSonic
Otherwise the framework will use its default behaviour: that is, open the file (assuming it's UTF8), read the `<?xml encoding` directive and maybe change the encoding, then convert the contents of the file to the given encoding ignoring all the errors.
DreamSonic
+1  A: 

This is what the original question asked for, even if it isn't quite what the original poster really needed. However, I've gone and written some C code to validate a byte stream as utf-8, and made it available freely. Maybe someone else directed at this question via a Google search will find it useful.

It takes one byte at a time, so is suitable for stream processing, and classifies everything into either valid UTF-8 or one of these possible errors in the byte sequence:

/* Ways a UTF stream can screw up */
/* a multibyte sequence without as many continuation bytes as expected.  e.g. [ef 81] 48 */
#define MISSING_CONTINUATION 1 
/* A continuation byte when not expected */
#define UNEXPECTED_CONTINUATION 2 
/* A full multibyte sequence encoding something that should have been encoded shorter */
#define OVERLONG_FORM 3
/* A full multibyte sequence encoding something larger than 10FFFF */
#define OUT_OF_RANGE 4
/* A full multibyte sequence encoding something in the range U+D800..U+DFFF */
#define BAD_SCALAR_VALUE 5
/* bytes 0xFE or 0xFF */
#define INVALID 6

This validator has the nice property that if a and b are valid utf-8 byte streams, and x is some other stream of bytes, then the concatenation a + x + b will be decoded as all of the characters encoded in a, some combination of characters and errors, then all of the characters encoded in x. That is, an invalid sequence of bytes can't eat validly encoded characters that start after the bad bytes.

Daniel Martin