views:

2558

answers:

6

Hello again,

I've got the lovely task of working out how to handle large files being loaded into our application's script editor (its like VBA for our internal product for quick macros). Most files are about 300-400Kb which is fine loading. But when they go beyond 100Mb the process has a hard time as you'd expect.

What happens is that the file is read and shoved into a RichTextBox which is then navigated - don't worry too much about this part.

The developer who wrote the initial code is simply using a StreamReader and doing [Reader].ReadToEnd() which could take quite a while to complete.

My task is to break this bit of code up, read it in chunks into a buffer and show a progressbar with an option to cancel it.

Some assumptions:

  • Most files will be 30-40Mb
  • The contents of the file is text (not binary), some are UNIX format, some are DOS.
  • Once the contents is retrieved we workout what terminator is used.
  • No-ones concerned once its loaded the time it takes to render in the richtextbox, its just the initial load of the text.

Now for the questions:

  • Can I simply use StreamReader, then check the Length property (so ProgressMax) and issue a Read for a set buffer size and iterate through in a while loop WHILST inside a background worker so it doesn't block the main UI thread? Then return the stringbuilder to the main thread once its completed.
  • The contents will be going to a StringBuilder, can I initialise the SB with the size of the stream if the length is available?

Are these (in your professional opinions) good ideas? I've had a few issues in the past with reading content from Streams because it will always miss the last few bytes or something, but I'll ask another question if this is the case.

+1  A: 

use a background worker and read only a limited number of lines, read more only when the user scrolls

and try to never use ReadToEnd(), it's one of the functions that you think "why did they make it?", its a Script-Kidies-Helper that goes fine with small things, but as you see, it sux for large files...

EDIT
those guys telling you to use StringBuilder need to read the MSDN more often:

Performance Considerations
The Concat and AppendFormat methods both concatenate new data to an existing String or StringBuilder object. A String object concatenation operation always creates a new object from the existing string and the new data. A StringBuilder object maintains a buffer to accommodate the concatenation of new data. New data is appended to the end of the buffer if room is available; otherwise, a new, larger buffer is allocated, data from the original buffer is copied to the new buffer, then the new data is appended to the new buffer. The performance of a concatenation operation for a String or StringBuilder object depends on how often a memory allocation occurs.
A String concatenation operation always allocates memory, whereas a StringBuilder concatenation operation only allocates memory if the StringBuilder object buffer is too small to accommodate the new data. Consequently, the String class is preferable for a concatenation operation if a fixed number of String objects are concatenated. In that case, the individual concatenation operations might even be combined into a single operation by the compiler. A StringBuilder object is preferable for a concatenation operation if an arbitrary number of strings are concatenated; for example, if a loop concatenates a random number of strings of user input.

That means huge allocation of memory, what becomes large use of swap files system, that simulates sections of your HDD to act like the RAM memory, but HDD is very slow. The StringBuilder option looks fine for who use the system as a mono-user, but when you have 2 or more users reading large files at the same time, you have a problem.

Tufo
far out you guys are super quick! unfortunately because of the way the macro's work the entire stream needs to be loaded. As I mentioned don't worry about the richtext part. Its the initial loading we're wanting to improve.
Nicole Lee
so you can work in parts, read first X lines, apply the macro, read the second X lines, apply the macro, and so on...if you explain what this macro do, we can help you with more precision
Tufo
A: 

You might be better off to use memory-mapped files handling here.. The memory mapped file support will be around in .NET 4 (I think...I heard that through someone else talking about it), hence this wrapper which uses p/invokes to do the same job..

Edit: See here on the MSDN for how it works, here's the blog entry indicating how it is done in the upcoming .NET 4 when it comes out as release. The link I have given earlier on is a wrapper around the pinvoke to achieve this. You can map the entire file into memory, and view it like a sliding window when scrolling through the file.

Hope this helps, Best regards, Tom.

tommieb75
+1  A: 

Have a look at the following code snippet. You have mentioned Most files will be 30-40mb, this claims to read 180mb in 1.4 seconds on an Intel Quad Core:

private int _bufferSize = 16384; 

private void ReadFile(string filename) 
{
    StringBuilder stringBuilder = new StringBuilder();     
    FileStream fileStream = new FileStream(filename, FileMode.Open, FileAccess.Read);  

    using (StreamReader streamReader = new StreamReader(fileStream))     
    {        
        char[] fileContents = new char[_bufferSize];         
        int charsRead = streamReader.Read(fileContents, 0, _bufferSize); 

        // Can't do much with 0 bytes        
        if (charsRead == 0)             
            throw new Exception("File is 0 bytes"); 

        while (charsRead > 0)         
        {             
            stringBuilder.Append(fileContents);             
            charsRead = streamReader.Read(fileContents, 0, _bufferSize); 
        }     
    } 
}

Original Article

James
@James: to quote from original article 'This example reads a 180mb text file in 1.4 seconds on an Intel Quad Core'....think you should edit your answer to put that in...the mileage will vary....
tommieb75
@tommie, ah good spot will do.
James
These kind of tests are notoriously unreliable. You'll read data from the file system cache when you repeat the test. That's at least one order of magnitude faster than a real test that reads the data off the disk. A 180 MB file cannot possibly take less than 3 seconds. Reboot your machine, run the test once for the real number.
Hans Passant
+2  A: 

This should be enough to get you started.

class Program
{        
    static void Main(String[] args)
    {
        const int bufferSize = 1024;

        var sb = new StringBuilder();
        var buffer = new Char[bufferSize];
        var length = 0L;
        var totalRead = 0L;
        var count = bufferSize; 

        using (var sr = new StreamReader(@"C:\Temp\file.txt"))
        {
            length = sr.BaseStream.Length;               
            while (count > 0)
            {                    
                count = sr.Read(buffer, 0, bufferSize);
                sb.Append(buffer, 0, count);
                totalRead += count;
            }                
        }

        Console.ReadKey();
    }
}
ChaosPandion
I would move the "var buffer = new char[1024]" out of the loop: it's not necessary to create a new buffer each time. Just put it before "while (count > 0)".
Tommy Carlier
Good point, gotta keep the GC happy.
ChaosPandion
+4  A: 

You say you have been asked to show a progress bar while a large file is loading. Is that because the users genuinely want to see the exact % of file loading, or just because they want visual feedback that something is happening?

If the latter is true, then the solution becomes much simpler. Just do reader.ReadToEnd() on a background thread, and display a marquee-type progress bar instead of a proper one.

I raise this point because in my experience this is often the case. When you are writing a data processing program, then users will definitely be interested in a % complete figure, but for simple-but-slow UI updates, they are more likely to just want to know that the computer hasn't crashed. :-)

Christian Hayter
Sagely Advice...
ChaosPandion
But can the user cancel out of the ReadToEnd call?
Tim
@Tim, well spotted. In that case, we're back to the `StreamReader` loop. However, it will still be simpler because there's no need to read ahead to calculate the progress indicator.
Christian Hayter
A: 

I know I am a little late on this one, but a iterator might be perfect for this type of work:

public static IEnumerable<int> LoadFileWithProgress(string filename, StringBuilder stringData)
{
    const int charBufferSize = 4096;
    using (FileStream fs = File.OpenRead(filename))
    {
        using (BinaryReader br = new BinaryReader(fs))
        {
            long length = fs.Length;
            int numberOfChunks = Convert.ToInt32((length / charBufferSize)) + 1;
            double iter = 100 / Convert.ToDouble(numberOfChunks);
            double currentIter = 0;
            yield return Convert.ToInt32(currentIter);
            while (true)
            {
                char[] buffer = br.ReadChars(charBufferSize);
                if (buffer.Length == 0) break;
                stringData.Append(buffer);
                currentIter += iter;
                yield return Convert.ToInt32(currentIter);
            }
        }
    }
}

You can all it using the following:

string filename = "C:\\myfile.txt";
StringBuilder sb = new StringBuilder();
foreach (int progress in LoadFileWithProgress(filename, sb))
{
    // Update your progress counter here!
}
string fileData = sb.ToString();

As the file is loaded, the iterator will return the progress number from 0 to 100, which you can use to update your progress bar. Once the loop has finished, the StringBuilder will contain the contents of the text file.

Also, because you want text, we can just BinaryReader to read in characters, which will ensure that your buffers line up correctly when reading any multi-byte characters (UTF-8, UTF-16, etc).

This is all done without using background tasks, threads, or complex custom state machines.

Extremeswank