ansaurus

Question

Answer 1

+1 A:

Nice point. I didn't know that there exists non valid UTF-8 sequences.

The article at the wikiedia is a starting point but I don't think that you can have a complete test. Can you? I am interested

A complete test means that for every sequence you can have a function that answers yes or no for every possible sequence. A full function.

The point is what to do or to return if your sequence is not complete (a short sequence). As far as I know there are some editors that add a special character in order to fulfill it. May be you should handle such cases as invalid sequences and then your test will be complete. I wonder if this is the only case.

Anyway, I will put this question as a favourite in order to keep track of answers. Sure somebody will illuminate us.

Luixv 2009-05-15 05:47:03

Confused. What do you mean test? :-)Actually I am a developed, I need simple samples which implement the same function to make reference. Appreciated if you could give me some reference samples. I have some invalid UTF-8 encoded samples to make test at hand.

George2 2009-05-15 05:53:21

Answer 2

+3 A:

What you need is DecoderFallback. When the Encoding class is trying to convert a sequence of bytes to the target encoding, you can specify fallback behaviour:

Either report error and stop processing.
Or find the error and replace it.

Using UTF8Encoding and DecoderReplacementFallback you can achieve just what you're looking for.

DreamSonic 2009-05-15 06:37:19

@DreamSonic, when using the following code snippet to load XML docuemnt and check against whether it is UTF-8 encoded using your suggested solution above, how to add fallback functions? XmlDocument xDoc = new XmlDocument(); xDoc.Load("c:\\abc.xml");

George2 2009-05-18 06:37:58

In this case you should use the XmlDocument.LoadXml(...) method instead of the Load(...). You should open the stream, read all its bytes and try to convert them to a given encoding using Encoding.GetString(...).

DreamSonic 2009-05-18 12:54:47

Otherwise the framework will use its default behaviour: that is, open the file (assuming it's UTF8), read the `<?xml encoding` directive and maybe change the encoding, then convert the contents of the file to the given encoding ignoring all the errors.

DreamSonic 2009-05-18 12:59:26

Answer 3

+1 A:

This is what the original question asked for, even if it isn't quite what the original poster really needed. However, I've gone and written some C code to validate a byte stream as utf-8, and made it available freely. Maybe someone else directed at this question via a Google search will find it useful.

It takes one byte at a time, so is suitable for stream processing, and classifies everything into either valid UTF-8 or one of these possible errors in the byte sequence:

/* Ways a UTF stream can screw up */
/* a multibyte sequence without as many continuation bytes as expected.  e.g. [ef 81] 48 */
#define MISSING_CONTINUATION 1 
/* A continuation byte when not expected */
#define UNEXPECTED_CONTINUATION 2 
/* A full multibyte sequence encoding something that should have been encoded shorter */
#define OVERLONG_FORM 3
/* A full multibyte sequence encoding something larger than 10FFFF */
#define OUT_OF_RANGE 4
/* A full multibyte sequence encoding something in the range U+D800..U+DFFF */
#define BAD_SCALAR_VALUE 5
/* bytes 0xFE or 0xFF */
#define INVALID 6

This validator has the nice property that if a and b are valid utf-8 byte streams, and x is some other stream of bytes, then the concatenation a + x + b will be decoded as all of the characters encoded in a, some combination of characters and errors, then all of the characters encoded in x. That is, an invalid sequence of bytes can't eat validly encoded characters that start after the bad bytes.

Daniel Martin 2009-05-21 11:23:07

ansaurus

tags:

views:

answers:

looking for samples to validate UTF-8

related questions