views:

673

answers:

12

I have a quad core machine and would like to write some code to parse a text file that takes advantage of all four cores. The text file basically contains one record per line.

Multithreading isn't my forte so I'm wondering if anyone could give me some patterns that I might be able to use to parse the file in an optimal manner.

My first thoughts are to read all the lines into some sort of queue and then spin up threads to pull the lines off the queue and process them, but that means the queue would have to exist in memory and these are faily large files so I'm not so keen on that idea.

My next thoughts are to have some sort of controller that will read in a line and assign it a thread to parse, but I'm not sure if the controller will end up being a bottleneck if the threads are processing the lines faster than it can read and assign them.

I know there's probably another simpler solution than both of these but at the moment I'm just not seeing it.

+7  A: 

I'd go with your original idea. If you are concerned that the queue might get too large implement a buffer-zone for it (i.e. If is gets above 100 lines the stop reading the file and if it gets below 20 then start reading again. You'd need to do some testing to find the optimal barriers). Make it so that any of the threads can potentially be the "reader thread" as it has to lock the queue to pull an item out anyway it can also check to see if the "low buffer region" has been hit and start reading again. While it's doing this the other threads can read out the rest of the queue.

Or if you prefer, have one reader thread assign the lines to three other processor threads (via their own queues) and implement a work-stealing strategy. I've never done this so I don't know how hard it is.

Wolfbyte
A: 

@Mike: That really doesn't help at all. Each line is going to contain a transaction information. Depending on the type of transaction will depend on what action will need to be taken. Some actions will be quick like calculation tax on a transaction and saving it to the database. Other actions will be slower and require a number of steps like updating the client transaction table (in a database) and then reconciling the transaction against a third party general ledger.

I find it hard to believe that multithreading those tasks is going to leave me disk bound.

What I do believe is that I get stuck processing a "slow" transaction like reconciling the transaction, shouldn't mean that another core can't go off and continue to process other transaction types.

So like I mentioned in the original post, any information on patterns that could help me multithread the parsing (and processing) of this file would be great.

lomaxx
A: 

@WolfByte: Those are some interesting ideas, particularly the buffering idea. I'll look into a work stealing strategy if I run into problems with buffering, but from what I can tell, buffering should provide a nice solution.

for now, that's going to be my accepted answer :)

Thanks

lomaxx
+1  A: 

My experience is with Java, not C#, so apologies if these solutions don't apply.

The immediate solution I can think up off the top of my head would be to have an executor that runs 3 threads (using Executors.newFixedThreadPool, say). For each line/record read from the input file, fire off a job at the executor (using ExecutorService.submit). The executor will queue requests for you, and allocate between the 3 threads.

Probably better solutions exist, but hopefully that will do the job. :-)

ETA: Sounds a lot like Wolfbyte's second solution. :-)

ETA2: System.Threading.ThreadPool sounds like a very similar idea in .NET. I've never used it, but it may be worth your while!

Chris Jester-Young
A: 

Its kind of related, but Tim Bray had a similar task, and turned it into kind of a concurrent programming language shoot-out/test case: http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-Finder

Michael Neale
+3  A: 

This will eliminate bottlenecks of having a single thread do the reading:

open file
for each thread n=0,1,2,3:
seek to file offset 1/n*filesize
scan to next complete line
process all lines in your part of the file
Mark Harrison
+7  A: 

Mark's answer is the simpler, more elegant solution. Why build a complex program with inter-thread communication if it's not necessary? Spawn 4 threads. Each thread calculates size-of-file/4 to determine it's start point (and stop point). Each thread can then work entirely independently.

The only reason to add a special thread to handle reading is if you expect some lines to take a very long time to process and you expect that these lines are clustered in a single part of the file. Adding inter-thread communication when you don't need it is a very bad idea. You greatly increase the chance of introducing an unexpected bottleneck and/or synchronization bugs.

Derek Park
A: 

@Derek & Mark: I wish there was a way to accept 2 answers. I'm going to have to end up going with Wolfbyte's solution because if I split the file into n sections there is the potential for a thread to come across a batch of "slow" transactions, however if I was processing a file where each process was guaranteed to require an equal amount of processing then I really like your solution of just splitting the file into chunks and assigning each chunk to a thread and being done with it.

lomaxx
A: 

@lomaxx

@Derek & Mark: I wish there was a way to accept 2 answers. I'm going to have to end up going with Wolfbyte's solution because if I split the file into n sections there is the potential for a thread to come across a batch of "slow" transactions, however if I was processing a file where each process was guaranteed to require an equal amount of processing then I really like your solution of just splitting the file into chunks and assigning each chunk to a thread and being done with it.

No worries. If clustered "slow" transactions is a issue, then the queuing solution is the way to go. Depending on how fast or slow the average transaction is, you might also want to look at assigning multiple lines at a time to each worker. This will cut down on synchronization overhead. Likewise, you might need to optimize your buffer size. Of course, both of these are optimizations that you should probably only do after profiling. (No point in worrying about synchronization if it's not a bottleneck.)

Derek Park
A: 

@lomaxx:

no problem, happy to help... let us know how it goes!

I had opened a uservoice suggestion to allow multiple accepted answers, but I haven't checked to see its status.

Mark Harrison
+1  A: 

Since the bottleneck will generally be in the processing and not the reading when dealing with files I'd go with the producer-consumer pattern. To avoid locking I'd look at lock free lists. Since you are using C# you can take a look at Julian Bucknall's Lock-Free List code.

graham.reeds
A: 

If the text that you are parsing is made up of repeated strings and tokens, break the file into chunks and for each chunk you could have one thread pre-parse it into tokens consisting of keywords, "punctuation", ID strings, and values. String compares and lookups can be quite expensive and passing this off to several worker threads can speed up the purely logical / semantic part of the code if it doesn't have to do the string lookups and comparisons.

The pre-parsed data chunks (where you have already done all the string comparisons and "tokenized" it) can then be passed to the part of the code that would actually look at the semantics and ordering of the tokenized data.

Also, you mention you are concerned with the size of your file occupying a large amount of memory. There are a couple things you could do to cut back on your memory budget.

Split the file into chunks and parse it. Read in only as many chunks as you are working on at a time plus a few for "read ahead" so you do not stall on disk when you finish processing a chunk before you go to the next chunk.

Alternatively, large files can be memory mapped and "demand" loaded. If you have more threads working on processing the file than CPUs (usually threads = 1.5-2X CPU's is a good number for demand paging apps), the threads that are stalling on IO for the memory mapped file will halt automatically from the OS until their memory is ready and the other threads will continue to process.

Adisak