tags:

views:

457

answers:

9

I have a large xml file (approx. 10 MB) in following simple structure:

<Errors>
   <Error>.......</Error>
   <Error>.......</Error>
   <Error>.......</Error>
   <Error>.......</Error>
   <Error>.......</Error>
</Errors>

My need is to write add a new node <Error> at the end before the </Errors> tag. Whats is the fastest way to achieve this in .net?

A: 

The quickest method is likely to be reading in the file using an XmlReader, and simply replicating each read node to a new stream using XmlWriter When you get to the point at which you encounter the closing </Errors> tag, then you just need to output your additional <Error> element before coninuing the 'read and duplicate' cycle. This way is inevitably going to be harder than than reading the entire document into the DOM (XmlDocument class), but for large XML files, much quicker. Admittedly, using StreamReader/StreamWriter would be somewhat faster still, but pretty horrible to work with in code.

Noldorin
A: 

How is your XML-File represented in code? Do you use the System.XML-classes? In this case you could use XMLDocument.AppendChild.

Dario
+4  A: 

First, I would disqualify System.Xml.XmlDocument because it is a DOM which requires parsing and building the entire tree in memory before it can be appended to. This means your 10 MB of text will be more than 10 MB in memory. This means it is "memory intensive" and "time consuming".

Second, I would disqualify System.Xml.XmlReader because it requires parsing the entire file first before you can get to the point of when you can append to it. You would have to copy the XmlReader into an XmlWriter since you can't modify it. This requires duplicating your XML in memory first before you can append to it.

The faster solution to XmlDocument and XmlReader would be string manipulation (which has its own memory issues):

string xml = @"<Errors><error />...<error /></Errors>";
int idx = xml.LastIndexOf("</Errors>");

xml = xml.Substring(0, idx) + "<error>new error</error></Errors>";

Chop off the end tag, add in the new error, and add the end tag back.

I suppose you could go crazy with this and truncate your file by 9 characters and append to it. Wouldn't have to read in the file and would let the OS optimize page loading (only would have to load in the last block or something).

System.IO.FileStream fs = System.IO.File.Open("log.xml", System.IO.FileMode.Open, System.IO.FileAccess.ReadWrite);
fs.Seek(-("</Errors>".Length), System.IO.SeekOrigin.End);
fs.Write("<error>new error</error></Errors>");
fs.Close();

That will hit a problem if your file is empty or contains only "<Errors></Errors>", both of which can easily be handled by checking the length.

Colin Burnett
OpenText() opens a file for reading and returns a StreamReader.
Daniel Brückner
Indeed, thanks. Fixed?
Colin Burnett
+3  A: 

The fastest way would probably be a direct file access.

using (StreamWriter file = File.AppendText("my.log"))
{
    file.BaseStream.Seek(-"</Errors>".Length, SeekOrigin.End);
    file.Write("   <Error>New error message.</Error></Errors>");
}

But you lose all the nice XML features and may easily corrupt the file.

Daniel Brückner
That's what I would have suggested too.
Joey Robert
I'm attempting this but get an 'Unable seek backward to overwrite data that previously existed in a file opened in Append mode.' error on the .Seek line. Is the example correct?
Simon
No, the examle is not correct, but all you need to do to get it working is replace 'File.AppendText(...)' with 'new StreamWriter(File.Open(filePath, FileMode.Open, FileAccess.Write)'
Hermann
A: 

I would use XmlDocument or XDocument to Load your file and then manipulate it accordingly.

I would then look at the possibility of caching this XmlDocument in memory so that you can access the file quickly.

What do you need the speed for? Do you have a performance bottleneck already or are you expecting one?

Robin Day
XmlDocument is a DOM model which is slower than SAX like that in XmlReader. XmlDocument would require representing the entire 10 MB in memory as objects (so more than 10 MB total). XmlReader would be faster (I'm fairly certain XmlDocument is built on XmlReader) but you still have to parse the entire document.Neither, to me, qualify as "fast" if all Ramesh is doing is appending to a log file (which appears to be the case).
Colin Burnett
I totally agree, but I would always avoid writing XML with text appends. My answer was to find out if he could load the document into memory and then write to that. That would be fast. Then another process that writes out the XmlDocument to the file occasionally. It all depends on the scenario.
Robin Day
A: 

Try this out:

        var doc = new XmlDocument();
        doc.LoadXml("<Errors><error>This is my first error</error></Errors>");

        XmlNode root = doc.DocumentElement;

        //Create a new node.
        XmlElement elem = doc.CreateElement("error");
        elem.InnerText = "This is my error";

        //Add the node to the document.
        if (root != null) root.AppendChild(elem);

        doc.Save(Console.Out);
        Console.ReadLine();
Jason Heine
This is definitely not the fastest way.
Hermann
A: 

Here's how to do it in C, .NET should be similar.

The game is to simple jump to the end of the file, skip back over the tag, append the new error line, and write a new tag.

#include <stdio.h>
#include <string.h>
#include <errno.h>

int main(int argc, char** argv) {
        FILE *f;

        // Open the file
        f = fopen("log.xml", "r+");

        // Small buffer to determine length of \n (1 on Unix, 2 on PC)
        // You could always simply hard code this if you don't plan on 
        // porting to Unix.
        char nlbuf[10];
        sprintf(nlbuf, "\n");

        // How long is our end tag?
        long offset = strlen("</Errors>");

        // Add in an \n char.
        offset += strlen(nlbuf);

        // Seek to the END OF FILE, and then GO BACK the end tag and newline
        // so we use a NEGATIVE offset.
        fseek(f, offset * -1, SEEK_END);

        // Print out your new error line
        fprintf(f, "<Error>New error line</Error>\n");

        // Print out new ending tag.
        fprintf(f, "</Errors>\n");

        // Close and you're done
        fclose(f);
}
Will Hartung
+5  A: 

You need to use the XML inclusion technique.

Your error.xml (doesn't change, just a stub. Used by XML parsers to read):

<?xml version="1.0"?>
<!DOCTYPE logfile [
<!ENTITY logrows    
 SYSTEM "errorrows.txt">
]>
<Errors>
&logrows;
</Errors>

Your errorrows.txt file (changes, the xml parser doesn't understand it):

<Error>....</Error>
<Error>....</Error>
<Error>....</Error>

Then, to add an entry to errorrows.txt:

using (StreamWriter sw = File.AppendText("logerrors.txt"))
{
    XmlTextWriter xtw = new XmlTextWriter(sw);

    xtw.WriteStartElement("Error");
    // ... write error messge here
    xtw.Close();
}

Or you can even use .NET 3.5 XElement, and append the text to the StreamWriter:

using (StreamWriter sw = File.AppendText("logerrors.txt"))
{
    XElement element = new XElement("Error");
    // ... write error messge here
    sw.WriteLine(element.ToString());
}

See also Microsoft's article Efficient Techniques for Modifying Large XML Files

taoufik
A: 

Using string-based techniques (like seeking to the end of the file and then moving backwards the length of the closing tag) is vulnerable to unexpected but perfectly legal variations in document structure.

The document could end with any amount of whitespace, to pick the likeliest problem you'll encounter. It could also end with any number of comments or processing instructions. And what happens if the top-level element isn't named Error?

And here's a situation that using string manipulation fails utterly to detect:

<Error xmlns="not_your_namespace">
   ...
</Error>

If you use an XmlReader to process the XML, while it may not be as fast as seeking to EOF, it will also allow you to handle all of these possible exception conditions.

Robert Rossney
The file he's presented looks like a log file and I presume he's hitting a point where it's increasingly slower to append to it, hence his question. Suffice to say that I think the log format is entirely under his control.
Colin Burnett
It can often be perfectly fine to make those assumptions. I've had to fix a lot of code where the developer guessed wrong, though. In most of those cases, the developer didn't even know he was guessing.
Robert Rossney