views:

64

answers:

1

Imagine that I have a C# application that edits text files. The technique employed for each file can be either:

1) Read the file at once in to a string, make the changes, and write the string over the existing file:

string fileContents = File.ReadAllText(fileName);

// make changes to fileContents here...

using (StreamWriter writer = new StreamWriter(fileName))
{
    writer.Write(fileContents);
}

2) Read the file by line, writing the changes to a temp file, then deleting the source and renaming the temp file:

using (StreamReader reader = new StreamReader(fileName))
{
    string line;

    using (StreamWriter writer = new StreamWriter(fileName + ".tmp"))
    {
        while (!reader.EndOfStream)
        {
            line = reader.ReadLine();
            // make changes to line here
            writer.WriteLine(line);
        }
    }
}
File.Delete(fileName);
File.Move(fileName + ".tmp", fileName);

What are the performance considerations with these options?

It seems to me that either reading by line or reading the entire file at once, the same quantity of data will be read, and disk times will dominate the memory alloc times. That said, once a file is in memory, the OS is free to page it back out, and when it does so the benefit of that large read has been lost. On the other hand, when working wit a temporary file, once the handles are closed I need to delete the old file and rename the temp file, which incurs a cost.

Then there are questions around caching, and prefetching, and disk buffer sizes...

I am assuming that in some cases, slurping the file is better, and in others, operating by line is better. My question is, what are the conditions for these two cases?

+2  A: 

in some cases, slurping the file is better, and in others, operating by line is better.

Very nearly; except that reading line-by-line is actually a much more specific case. The actual choices we want to distinguish between are ReadAll and using a buffer. ReadLine makes assumptions - the biggest one being that the file actually has lines, and they are a reasonable length! If we can't make this assumption about the file, we want to choose a specific buffer size and read into that, regardless of whether we've reached the end of a line or not.

So deciding between reading it all at once and using a buffer - always go with the easiest to implement, and most naive approach until you run into a specific situation that does not work for you - and having a concrete case, you can make an educated decision based on the information you actually have, rather than speculating about hypothetical situations.

Simplest - read it all at once.

Is performance becoming a problem? Does this application run against uncontrolled files, so their size is not predictable? Just a few examples where you want to chunk it.

Rex M
The observation that reading by-line is really a special case of buffered reads is astute. Thank you!
fatcat1111