tags:

views:

2485

answers:

5

I have a process which picks up a series of "xml" files. The reason I put xml in quotes is that that the text in the file does not have a root element which makes in invalid xml. In my processing, I want to correct this and open up each file add a root node to the beginning and end of each file, and then close it up. Here is what I had in mind, but this involves opening the file, reading the entire file, tagging on the nodes, and then writing the entire file out. These files may be more than 20 MB in size.

        foreach (FileInfo file in files)
        {
            //open the file
            StreamReader sr = new StreamReader(file.FullName);

            // add the opening and closing tags
            string text = "<root>" + sr.ReadToEnd() + "<root>";
            sr.Close();

            // now open the same file for writing
            StreamWriter sw = new StreamWriter(file.FullName, false);
            sw.Write(text);
            sw.Close();
        }

Any recommendations?

+4  A: 

I can't see any real improvement on this...which is kind of a bummer. Since there's no way to "shift" a file you'll always have to move the bytes in the entire file to inject anything at the top.

You may find some performance benefit by using raw streams rather than the StreamReader which has to actually parse the stream as text.

Paul Alexander
+13  A: 

To avoid holding the whole file in memory, rename the original file, then open it with StreamReader. Then open the original filename with StreamWriter to create a new file.

Write the <root> prefix to the file, then copy data in large-ish chunks from the reader to the writer. When you've transferred all the data, write the closing </root> (note the forward slash if you want it to be XML). Then close both files and delete the renamed original.

char[] buffer = new char[10000];

string renamedFile = file.FullName + ".orig";
File.Move(file.FullName, renamedFile);

using (StreamReader sr = new StreamReader(renamedFile))
using (StreamWriter sw = new StreamWriter(file.FullName, false))
{
    sw.Write("<root>");

    int read;
    while ((read = sr.Read(buffer, 0, buffer.Length)) > 0)
        sw.Write(buffer, 0, read);

    sw.Write("</root>");
}

File.Delete(renamedFile);
Daniel Earwicker
This is a good improvment, but it's still worth noting that raw Stream objects will perform better than the StreamReader/Writer classes.
Paul Alexander
+3  A: 

If you do not want to do this is C#, it would be easy to handle at the commandline or in a batch file.

ECHO ^<root^> > outfile.xml
TYPE temp.xml >> outfile.xml
ECHO ^</root^> >> outfile.xml

This would assume that you have some existing process for getting the data files that this could be hooked into.

Jack Bolding
The code to manage the call to this batch file into the enclosing C# program would likely be significantly longer than the code to do the same thing directly within C#.
Daniel Earwicker
Always assuming that the batch processing cannot be added somewhere else in the process of getting the file, Process p = Process.Start("fixup.bat", "temp.xml"); p.WaitForExit(); Doesn't seem that bad to me. The flicker of the command window does get annoying.
Jack Bolding
You also need to tell the batch file which files to look at, which means you have to build the command line string, with the correct quoting, and you'd need to figure out the path to the batch file, and you need to delete the temporary file (one line, wherever you put it). I'm amazed you'd even consider this option! It would be trivial to write some neat little helper methods that would allow you to express this kind of stream concatenation elegantly within C#, and so have no process creation overhead at all.
Daniel Earwicker
C# is a great hammer, isn't it? You've assumed that the file magically gets there and there is no place in the process of obtaining the file that you cannot add in the batch file. BTW it is trivial to create and delete the temp file in the actual batch file. No need to clutter the C# code with it.
Jack Bolding
I'm assuming, given the question title, that this guy is writing a C# app. If your application is in C#, it makes sense to avoid a process creation, synchronisation and all the additional complexity that entails, to fire up a different language without a good reason. I'd argue that in this context, you are looking for an excuse to use batch files, and haven't found one. Batch files, generally speaking, are a severely underpowered programming environment.
Daniel Earwicker
No, not looking for an excuse to use batch file at all, just proposing an alternative to doing it strictly in C#. You have already covered the in C# answer quite well (I even upvoted it) but to believe that it will cover all situations seems a bit obtuse. It does seem possible that someone else may find this question and find that my solution is perhaps even useful...Now if I could only do it with LINQ, I bet I would get many upvotes;->
Jack Bolding
I'm sure it will be useful - especially to anyone thinking of writing a whole C# app just to do this one task, which I agree would be a total waste of time. Batch or shell tools are an ideal and ready-to-use alternative, if all you're doing is one-shot stream processing and nothing else. But if you're already working in C#, and doing this kind of thing a lot, just write a simple helper class that is based on List<Stream> and you will be able to make the C# version look as elegant (probably more so) than the batch file version. :)
Daniel Earwicker
Batch files. Yuck. How about a shell script? One line, and no temp files:vim -e -s -c $'1 i\n<root>\n.\n$ a\n</root>\n.\nwq' fileToXmlify.txt
Eric Smith
+2  A: 

20 MB is not terribly much, but when you read it as a string, it will use about 40 MB of memory. That's not terribly much either, but it's processing that you don't need to do. You can handle it as raw bytes to reduce the memory usage, and to avoid decoding and re-encoding the data:

byte[] start = Encoding.UTF8.GetBytes("<root>");
byte[] ending = Encoding.UTF8.GetBytes("</root>");

byte[] data = File.ReadAllBytes(file.FullName);

int bom = (data[0] == 0xEF) ? 3 : 0;

using (FileStream s = File.Create(file.FullName)) {
   if (bom > 0) {
      s.Write(data, 0, bom);
   }
   s.Write(start, 0, start.Length);
   s.Write(data, bom, data.Length - bom);
   s.Write(ending, 0, ending.Length);
}

If you need to recude the memory usage much more, use a second file as Earwicker suggested.

Edit:
Added code to handle BOM (byte order mark).

Guffa
"40 MB of memory ... That's not terribly much either ..." *cough*.
Eric Smith
A: 

Isn't there a way to open the file, insert a node at the very begining of the file then move the file pointer to the end of the file and write another node, without reading the file contents into a string?!

anon