views:

86

answers:

3

Let's say that you want to write an application that processes multiple text files, supplied as arguments at the command line (e.g., MyProcessor file1 file2 ...). This is a very common task for which Perl is often used, but what if one wanted to take advantage of .NET directly and use C#.

What is the simplest C# 4.0 application boiler plate code that allows you to do this? It should include basically line by line processing of each line from each file and doing something with that line, by either calling a function to process it or maybe there's a better way to do this sort of "group" line processing (e.g., LINQ or some other method).

+2  A: 

Simple;



foreach(var f in args)
{
   var filecontent = File.ReadToEnd();
   //Logic goes here
}

Jon Preece
Do you ever right this code? and what purpose does it solve
saurabh
its solves the problem of reading multiple files, where their path has been passed in through arguments
Jon Preece
Sadly, ReadToEnd will load the entire file into memory. This is cost prohibitive for large files. The key here is to efficiently process the file one line at a time, regardless of file size.
Michael Goldshteyn
try<pre><code>string[] fileContent = File.ReadLines();</code></pre>Something like that
Jon Preece
Will use a lot of memory with large files as it copies the entire contents into memory even though the poster only requires a line at a time.
DamienG
+9  A: 

You could process files in parallel by reading each line and passing it to a processing function:

class Program
{
    static void Main(string[] args)
    {
        Parallel.ForEach(args, file =>
        {
            using (var stream = File.OpenRead(file))
            using (var reader = new StreamReader(stream))
            {
                string line;
                while ((line = reader.ReadLine()) != null) 
                {
                    ProcessLine(line);
                }
            }
        });
    }

    static void ProcessLine(string line)
    {
        // TODO: process the line
    }
}

Now simply call : SomeApp.exe file1 file2 file3

Pros of this approach:

  • Files are processed in parallel => taking advantage of multiple CPU cores
  • Files are read line by line and only the current line is kept into memory which reduces memory consumption and allows you to work with big files
Darin Dimitrov
A very interesting solution I must say. I suppose to make the processing non-parallel wouldn't be too hard, either?
Michael Goldshteyn
+1 For using latest Parallel concept
saurabh
@Michael: Just swap out the Parallel.ForEach with a standard foreach loop...
Reed Copsey
+1  A: 

After much experimenting, changing this line in Darin Dimitrov's answer:

using (var stream = File.OpenRead(file))

to:

using (var stream=new FileStream(file,System.IO.FileMode.Open,
                                 System.IO.FileAccess.Read,
                                 System.IO.FileShare.ReadWrite,
                                 65536))

to change the read buffer size from the 4KB default to 64KB can shave as much as 10% off of the file read time when read "line at a time" via a stream reader, especially if the text file is large. Larger buffer sizes do not seem to improve performance further.

This improvement is present, even when reading from a relatively fast SSD. The savings are even more substantial if an ordinary HD is used. Interestingly, you get this significant performance improvement even if the file is already cached by the (Windows 7 / 2008R2) OS, which is somewhat counterintuitive.

Michael Goldshteyn