views:

358

answers:

4

I'm writing an app which needs to process a large text file (comma-separated with several different types of records - I do not have the power or inclination to change the data storage format). It reads in records (often all the records in the file sequentially, but not always), then the data for each record is passed off for some processing.

Right now this part of the application is single threaded (read a record, process it, read the next record, etc.) I'm thinking it might be more efficient to read records in a queue in one thread, and process them in another thread in small blocks or as they become available.

I have no idea how to start programming something like that, including the data structure that would be necessary or how to implement the multithreading properly. Can anyone give any pointers, or offer other suggestions about how I might improve performance here?

+2  A: 

Have a look at this article here on CodeProject and on Filehelpers.com

Hope this helps, Best regards, Tom.

tommieb75
I do not need the full functionality of a CSV parser - I know my data will not have any commas, quotation marks, or newlines in the fields. Full CSV is such a retarded format anyway.
jnylen
+1  A: 

Take a look at this tutorial, it contains all you need... These are the microsoft tutorials including code samples for a similiar case as you describe. Your producer fills the queue, while the consumer pops records off.

Creating, starting, and interacting between threads

Synchronizing two threads: a producer and a consumer

Chris Kannon
+2  A: 

You might get a benefit if you can balance the time processing records against the time reading records; in which case you could use a producer/consumer setup, for example synchronized queue and a worker (or a few) dequeueing and processing. I might also be tempted to investigate parallel extensions; it is pertty easy to write an IEnumerable<T> version of your reading code, after which Parallel.ForEach (or one of the other Parallel methods) should actually do everything you want; for example:

static IEnumerable<Person> ReadPeople(string path) {
    using(var reader = File.OpenText(path)) {
        string line;
        while((line = reader.ReadLine()) != null) {
            string[] parts = line.Split(',');
            yield return new Person(parts[0], int.Parse(parts[1]);
        }
    }
}
Marc Gravell
Your blocking queue seems to be what I am looking for, thanks. I will try that today.
jnylen
What about a lock-free queue: http://www.boyet.com/Articles/LockfreeQueue.html ? Would that be an improvement in my case? How can I use a profiler to determine how much time is spent waiting for other threads' locks to be released?
jnylen
A: 

You may also look at asynchronous I/O. In this style, you'll start a file operation from the main thread, it will then continue running in background and when it completes, it invokes a callback that you specified. In the meantime, you can continue doing other things (such as processing the data). For example, you could start an asynchronous operation to read the next 1000 bytes, then process the 1000 bytes you already have and then wait for the next kilobyte.

Unfortunately, programming asynchronous operations in C# is a bit painful. There is a MSDN sample, but it's not nice at all. This can be nicely solved in F# using asynchronous workflows. I wrote an article that explains the problem and shows how to do similar thing using C# iterators.

A more promissing solution for C# is Wintellect PowerThreading library which supports similar trick using C# iterators. There is a good introductory article in MSDN Concurrency Affairs by Jeffrey Richter.

Tomas Petricek