views:

1374

answers:

2

Hello everyone,

I have an input file and it is very big (about 120M), and I do not want to load it into memory at once. My purpose is to check whether this file is using valid UTF-8 encoding encoded file. Any ideas to have a quick check without reading all file content into memory in the form of byte[]? Simple sample code appreciated.

I am using VSTS 2008 + C#.

thanks in advance, George

I find it is very strange, when using XMLDocument to load an XML document which contains invalid byte sequences, there is exception, but when reading all content into byte array then check against UTF-8, there is no exception, any ideas?

Here is the content of my XML file,

http://i42.tinypic.com/wioc9c.jpg

You can download the file from,

http://www.filefactory.com/file/ag00da3/n/a_xml

EDIT1:

class Program
{
    public static byte[] RawReadingTest(string fileName)
    {
        byte[] buff = null;

        try
        {
            FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read);
            BinaryReader br = new BinaryReader(fs);
            long numBytes = new FileInfo(fileName).Length;
            buff = br.ReadBytes((int)numBytes);
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
        }

        return buff;
    }

    static void XMLTest()
    {
        try
        {
            XmlDocument xDoc = new XmlDocument();
            xDoc.Load("c:\\abc.xml");
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
        }
    }

    static void Main()
    {
        try
        {
            XMLTest();
            Encoding ae = Encoding.GetEncoding("utf-8");
            string filename = "c:\\abc.xml";
            ae.GetString(RawReadingTest(filename));
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
        }

        return;
    }
}

EDIT2: I have tried when using new UTF8Encoding(true, true) there will be exception, but when using new UTF8Encoding(false, true), there is no exception thrown. I am confused because it should be the 2nd parameter which controls whether exception is thrown when there is invalid byte sequences, why the 1st parameter matters?

    public static void TestTextReader2()
    {
        try
        {
            // Create an instance of StreamReader to read from a file.
            // The using statement also closes the StreamReader.
            using (StreamReader sr = new StreamReader(
                "c:\\a.xml",
                new UTF8Encoding(true, true)
                ))
            {
                int bufferSize = 10 * 1024 * 1024; //could be anything
                char[] buffer = new char[bufferSize];
                // Read from the file until the end of the file is reached.
                int actualsize = sr.Read(buffer, 0, bufferSize);
                while (actualsize > 0)
                {
                    actualsize = sr.Read(buffer, 0, bufferSize);
                }
            }
        }
        catch (Exception e)
        {
            // Let the user know what went wrong.
            Console.WriteLine("The file could not be read:");
            Console.WriteLine(e.Message);
        }

    }
+5  A: 
var buffer = new char[32768] ;

using (var stream = new StreamReader (pathToFile, 
    new UTF8Encoding (true, true)))
{
    while (true)
    try
    {
        if (stream.Read (buffer, 0, buffer.Length) == 0)
            return GoodUTF8File ;
    }
    catch (ArgumentException)
    {
        return BadUTF8File ;
    }
}
Anton Tykhyy
But if a character using multiple bytes span chunks, how do you handle such situation?
George2
@George - the reader will deliver *decoded* chunks, which you just discard. If the entire stream decodes, it was valid. No question of encoded *bytes* spanning the chunks of *chars* you read.
Software Monkey
@Software Monkey, I am confused about what do you mean "the reader will deliver" -- could you show your code snippet please?
George2
Just keep calling TextReader.Read(char[], int, int), reusing the same buffer. The reader makes sure that it copes with multi-byte characters.
Jon Skeet
Copnfused. TextReader does not have a contructor which supports reading files. BTW: could you show a simple sample code snippet please?
George2
@Jon Skeet and @Anton Tykhyy, I find it is very strange that using XMLDocument to load and using BinaryReader to load then check against UTF-8 encoding, there will be different results. Any ideas?
George2
@George2 You can create a FileStream and pass that to TextReader. that is how you use streams effectively with the readers.
Spence
No, Spence: TextReader is an abstract base class for StreamReader and StringReader.
ChrisW
@Spence and @ChrisW, please see EDIT1 section of my original post to find the content of XML file I am using. My confusion is, XMLDocument.Load method will treat it as invalid UTF-8 encoded document, but UTF-8 TextReader will treat it as valid encoding (no exceptions), any ideas what is wrong?
George2
I have found the solution about how to filter out the invalid byte sequences of UTF-8, but met with a new issue here,http://stackoverflow.com/questions/877338/where-is-leak-in-my-codeappreciated if you could take a look and share insights. :-)
George2
+3  A: 

@George2 I think they mean a solution like the following (which I haven't tested).

Handling the transition between buffers (i.e. caching extra bytes/partial chars between reads) is the responsibillity and an internal implementation detail of the StreamReader implementation.

using System;
using System.IO;
using System.Text;

class Test 
{
    public static void Main() 
    {
        try 
        {
            // Create an instance of StreamReader to read from a file.
            // The using statement also closes the StreamReader.
            using (StreamReader sr = new StreamReader(
                "TestFile.txt",
                Encoding.UTF8
                ))
            {
                const int bufferSize = 1000; //could be anything
                char[] buffer = new char[bufferSize];
                // Read from the file until the end of the file is reached.
                while (bufferSize == sr.Read(buffer, bufferSize, 0)) 
                {
                    //successfuly decoded another buffer's-worth of data
                }
            }
        }
        catch (Exception e) 
        {
            // Let the user know what went wrong.
            Console.WriteLine("The file could not be read:");
            Console.WriteLine(e.Message);
        }
    }
}
ChrisW
@ChrisW, a small bug, Read(buffer, bufferSize, 0), should be Read(buffer, 0, bufferSize). :-)Another issue is, I find your method and using XMLDocument.Load will have different results. Your method will never throw any exception even if there is invalid byte sequences of UTF-8 in underlying file (e.g. TestFile.txt), but XMLDocument.Load will throw exception. Please refer to EDIT1 section of my original post. Any ideas what is wrong?
George2
I don't know (I was only giving a code example to parrot the suggestions below). What exception are you catching? Do you know (independently) whether or not the UTF8 in the file is correct? If you're sure it's incorrect, and the code above isn't failing, try running the code with Visual Studio set to catch exceptions when they're thrown, instead of only when they're unhandled? Because maybe (though I wouldn't know why) the StreamReader implementation silently catches any Encoding exceptions.
ChrisW
@ChrisW, my XML file is simple and small, the content is,http://i42.tinypic.com/wioc9c.jpgwhen using XMLDocument.Load, the xml file will be treated as invalid UTF-8 encoding, but when using your method, it will be treated as valid encoding -- no exceptions, any ideas?
George2
If you want to read the file using XmlDocument.Load, I'd try removing the begining-of-file marker: the first three 0xEF 0xBB 0xBF bytes.
ChrisW
@ChrisW, when using XMLDocument.Load, you can find the invalid bytes are not the beginning 3 ones. I have uploaded my original files for you to debug,http://www.filefactory.com/file/ag00da3/n/a_xmlyou can see it is so strange! XmlDocument.Load reports it as invalid UTF-8 byte sequences but your method report is as an healthy one. :-)Any ideas what is wrong?
George2
catch (Exception e) is a really bad idea.
Anton Tykhyy
@Anton, I agree, and I just want to know what is wrong. Any ideas about how to change the default system behavior (convert invalid character to '?'), and invoke my own replacement fallback function (I want to replace all invalid characters to empty)?
George2
@George2: in this case of catch(Exception), you'll also catch file-not-found, access denied, etc. which is likely not what you need. Re change default behaviour: create your own DecoderReplacementFallback object.
Anton Tykhyy
I totally agree and I will accept your comments when convert my prototype code into production level code. :-)I have written some code by myself, and please refer to EDIT2 section of my original post, I have tried when using new UTF8Encoding(true, true) there will be exception, but when using new UTF8Encoding(false, true), there is no exception thrown. I am confused because it should be the 2nd parameter which controls whether exception is thrown when there is invalid byte sequences, why the 1st parameter matters?
George2
I have found the solution about how to filter out the invalid byte sequences of UTF-8, but met with a new issue here,http://stackoverflow.com/questions/877338/where-is-leak-in-my-codeappreciated if you could take a look and share insights. :-)
George2