ansaurus

Question

Answer 1

+5 A:

var buffer = new char[32768] ;

using (var stream = new StreamReader (pathToFile, 
    new UTF8Encoding (true, true)))
{
    while (true)
    try
    {
        if (stream.Read (buffer, 0, buffer.Length) == 0)
            return GoodUTF8File ;
    }
    catch (ArgumentException)
    {
        return BadUTF8File ;
    }
}

Anton Tykhyy 2009-05-18 05:45:27

But if a character using multiple bytes span chunks, how do you handle such situation?

George2 2009-05-18 05:48:20

@George - the reader will deliver *decoded* chunks, which you just discard. If the entire stream decodes, it was valid. No question of encoded *bytes* spanning the chunks of *chars* you read.

Software Monkey 2009-05-18 05:51:30

@Software Monkey, I am confused about what do you mean "the reader will deliver" -- could you show your code snippet please?

George2 2009-05-18 05:56:33

Just keep calling TextReader.Read(char[], int, int), reusing the same buffer. The reader makes sure that it copes with multi-byte characters.

Jon Skeet 2009-05-18 06:01:41

Copnfused. TextReader does not have a contructor which supports reading files. BTW: could you show a simple sample code snippet please?

George2 2009-05-18 06:33:24

@Jon Skeet and @Anton Tykhyy, I find it is very strange that using XMLDocument to load and using BinaryReader to load then check against UTF-8 encoding, there will be different results. Any ideas?

George2 2009-05-18 07:02:41

@George2 You can create a FileStream and pass that to TextReader. that is how you use streams effectively with the readers.

Spence 2009-05-18 08:44:56

No, Spence: TextReader is an abstract base class for StreamReader and StringReader.

ChrisW 2009-05-18 08:46:16

@Spence and @ChrisW, please see EDIT1 section of my original post to find the content of XML file I am using. My confusion is, XMLDocument.Load method will treat it as invalid UTF-8 encoded document, but UTF-8 TextReader will treat it as valid encoding (no exceptions), any ideas what is wrong?

George2 2009-05-18 08:55:10

I have found the solution about how to filter out the invalid byte sequences of UTF-8, but met with a new issue here,http://stackoverflow.com/questions/877338/where-is-leak-in-my-codeappreciated if you could take a look and share insights. :-)

George2 2009-05-18 11:54:01

Answer 2

+3 A:

@George2 I think they mean a solution like the following (which I haven't tested).

Handling the transition between buffers (i.e. caching extra bytes/partial chars between reads) is the responsibillity and an internal implementation detail of the StreamReader implementation.

using System;
using System.IO;
using System.Text;

class Test 
{
    public static void Main() 
    {
        try 
        {
            // Create an instance of StreamReader to read from a file.
            // The using statement also closes the StreamReader.
            using (StreamReader sr = new StreamReader(
                "TestFile.txt",
                Encoding.UTF8
                ))
            {
                const int bufferSize = 1000; //could be anything
                char[] buffer = new char[bufferSize];
                // Read from the file until the end of the file is reached.
                while (bufferSize == sr.Read(buffer, bufferSize, 0)) 
                {
                    //successfuly decoded another buffer's-worth of data
                }
            }
        }
        catch (Exception e) 
        {
            // Let the user know what went wrong.
            Console.WriteLine("The file could not be read:");
            Console.WriteLine(e.Message);
        }
    }
}

ChrisW 2009-05-18 07:07:09

@ChrisW, a small bug, Read(buffer, bufferSize, 0), should be Read(buffer, 0, bufferSize). :-)Another issue is, I find your method and using XMLDocument.Load will have different results. Your method will never throw any exception even if there is invalid byte sequences of UTF-8 in underlying file (e.g. TestFile.txt), but XMLDocument.Load will throw exception. Please refer to EDIT1 section of my original post. Any ideas what is wrong?

George2 2009-05-18 07:34:51

I don't know (I was only giving a code example to parrot the suggestions below). What exception are you catching? Do you know (independently) whether or not the UTF8 in the file is correct? If you're sure it's incorrect, and the code above isn't failing, try running the code with Visual Studio set to catch exceptions when they're thrown, instead of only when they're unhandled? Because maybe (though I wouldn't know why) the StreamReader implementation silently catches any Encoding exceptions.

ChrisW 2009-05-18 08:23:13

@ChrisW, my XML file is simple and small, the content is,http://i42.tinypic.com/wioc9c.jpgwhen using XMLDocument.Load, the xml file will be treated as invalid UTF-8 encoding, but when using your method, it will be treated as valid encoding -- no exceptions, any ideas?

George2 2009-05-18 08:51:25

If you want to read the file using XmlDocument.Load, I'd try removing the begining-of-file marker: the first three 0xEF 0xBB 0xBF bytes.

ChrisW 2009-05-18 08:58:10

@ChrisW, when using XMLDocument.Load, you can find the invalid bytes are not the beginning 3 ones. I have uploaded my original files for you to debug,http://www.filefactory.com/file/ag00da3/n/a_xmlyou can see it is so strange! XmlDocument.Load reports it as invalid UTF-8 byte sequences but your method report is as an healthy one. :-)Any ideas what is wrong?

George2 2009-05-18 09:03:36

catch (Exception e) is a really bad idea.

Anton Tykhyy 2009-05-18 09:43:06

@Anton, I agree, and I just want to know what is wrong. Any ideas about how to change the default system behavior (convert invalid character to '?'), and invoke my own replacement fallback function (I want to replace all invalid characters to empty)?

George2 2009-05-18 10:15:19

@George2: in this case of catch(Exception), you'll also catch file-not-found, access denied, etc. which is likely not what you need. Re change default behaviour: create your own DecoderReplacementFallback object.

Anton Tykhyy 2009-05-18 10:33:58

I totally agree and I will accept your comments when convert my prototype code into production level code. :-)I have written some code by myself, and please refer to EDIT2 section of my original post, I have tried when using new UTF8Encoding(true, true) there will be exception, but when using new UTF8Encoding(false, true), there is no exception thrown. I am confused because it should be the 2nd parameter which controls whether exception is thrown when there is invalid byte sequences, why the 1st parameter matters?

George2 2009-05-18 10:58:09

I have found the solution about how to filter out the invalid byte sequences of UTF-8, but met with a new issue here,http://stackoverflow.com/questions/877338/where-is-leak-in-my-codeappreciated if you could take a look and share insights. :-)

George2 2009-05-18 11:54:10

ansaurus

tags:

views:

answers:

decode a file stream using UTF-8

related questions