Hello everyone,
I have an input file and it is very big (about 120M), and I do not want to load it into memory at once. My purpose is to check whether this file is using valid UTF-8 encoding encoded file. Any ideas to have a quick check without reading all file content into memory in the form of byte[]? Simple sample code appreciated.
I am using VSTS 2008 + C#.
thanks in advance, George
I find it is very strange, when using XMLDocument to load an XML document which contains invalid byte sequences, there is exception, but when reading all content into byte array then check against UTF-8, there is no exception, any ideas?
Here is the content of my XML file,
http://i42.tinypic.com/wioc9c.jpg
You can download the file from,
http://www.filefactory.com/file/ag00da3/n/a_xml
EDIT1:
class Program
{
public static byte[] RawReadingTest(string fileName)
{
byte[] buff = null;
try
{
FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read);
BinaryReader br = new BinaryReader(fs);
long numBytes = new FileInfo(fileName).Length;
buff = br.ReadBytes((int)numBytes);
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
return buff;
}
static void XMLTest()
{
try
{
XmlDocument xDoc = new XmlDocument();
xDoc.Load("c:\\abc.xml");
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
}
static void Main()
{
try
{
XMLTest();
Encoding ae = Encoding.GetEncoding("utf-8");
string filename = "c:\\abc.xml";
ae.GetString(RawReadingTest(filename));
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
return;
}
}
EDIT2: I have tried when using new UTF8Encoding(true, true) there will be exception, but when using new UTF8Encoding(false, true), there is no exception thrown. I am confused because it should be the 2nd parameter which controls whether exception is thrown when there is invalid byte sequences, why the 1st parameter matters?
public static void TestTextReader2()
{
try
{
// Create an instance of StreamReader to read from a file.
// The using statement also closes the StreamReader.
using (StreamReader sr = new StreamReader(
"c:\\a.xml",
new UTF8Encoding(true, true)
))
{
int bufferSize = 10 * 1024 * 1024; //could be anything
char[] buffer = new char[bufferSize];
// Read from the file until the end of the file is reached.
int actualsize = sr.Read(buffer, 0, bufferSize);
while (actualsize > 0)
{
actualsize = sr.Read(buffer, 0, bufferSize);
}
}
}
catch (Exception e)
{
// Let the user know what went wrong.
Console.WriteLine("The file could not be read:");
Console.WriteLine(e.Message);
}
}