views:

744

answers:

7

I'm trying to determine what is the fastest way to read in large text files with many rows, do some processing, and write them to a new file. In C#/.net, it appears StreamReader is a seemingly quick way of doing this but when I try to use for this file (reading line by line), it goes about 1/3 the speed of python's I/O (which worries me because I keep hearing that Python 2.6's IO was relatively slow).

If there isn't faster .Net solution for this, would it be possible to write a solution faster than StreamReader or does it already use complicated buffer/algorithm/optimizations that I would never hope to beat?

+2  A: 

StreamReader is pretty good - how were you reading it in Python? It's possible that if you specify a simpler encoding (e.g. ASCII) then that may speed things up. How much CPU is the process taking?

You can increase the buffer size by using the appropriate StreamReader constructor, but I have no idea how much difference that's likely to make.

Jon Skeet
I would expect that increasing the buffer size of his StreamWriter (presumably he's using one) would make a pretty good difference, though.
P Daddy
+3  A: 

Do you have a code sample of what your doing, or the format of the file you are reading?

Another good question would be how much of the stream are you keeping in memory at a time?

N8
A: 

A general note:

  1. High performance streaming isn't complicated. You usually have to modify the logic that uses the streamed data; that's complicated.

Actually, that's it.

MSN

MSN
A: 

Sorry if I'm not a .NET guru, but in C/C++, if you have nice big buffers, you should be able to parse it with an LL1 parser not much slower than you can scan the bytes. I can give more detail if you want.

Mike Dunlavey
A: 

Try BufferedReader and BufferedWriter to speed up processing.

pro
I think they are Java classes. StreamReader for .Net is already buffered.
GvS
Yes, those are indeed Java classes, he's looking for a fix in C#. If it was Java I'd recommend the same thing.
Forrest Marvez
A: 

The default buffer sizes used by StreamReader/FileStream may not be optimal for the record lengths in your data, so you can try tweaking them. You can override the default buffer lengths in the constructors to both FileStream and the StreamReader which wraps it. You should probably make them the same size.

DSO
+1  A: 

If your own code is examining one character at a time, you want to use a sentinel to mark the end of a buffer or the end of file, so that you have just one test in your inner loop. In your case that one test will be for end of line, so you'll want to temporarily stick a newline at the end of each buffer, for example.

The Wikipedia article on sentinels is not helpful at all; it doesn't describe this case. You can find a description in any of Robert Sedgewick's algorithms textbooks.

You might also want to look at re2c, which can generate very fast code for scanning text data. It generates C code but you may be able to adapt it, and you can certainly learn the techniques by reading their paper about re2c.

Norman Ramsey