views:

26

answers:

3

I have a database table having HTML content stored as binary serialized blob. I need to retrieve content one by one, look for certain keywords in the content (and report the matches found) and also save the content to the disk as HTML files. Can I parallize this using Parallel.ForEach? Is this a good Idea or there is a better one.

Thanks in advance for help, Ashish

+1  A: 

The I/O performance to the database and disk will be so much slower than your processor speed, that you likely will not see any noticeable benefit from parallelization.

Brent Arias
@Mystagogue, thanks for your reply. Are you generalizing (showness) that you should not parallel anything which involves IO or you are talking about this question. Could you please explain?
ydobonmai
Yes, in general, disk I/O and super-computing / parallel processing don't go well together. There are some scenarios where it might make sense. For example, if you have a reader-writer problem (many readers that all want the same data, possibly one writer) then it may make sense.
Brent Arias
+1  A: 

I would suspect that if you can pull a set of rows out of the database in one query and processed each in parallel looking for keywords, and then saving the batch back to disk in a single step, you'd see significant benefits. If you are selecting one by one and processing them in a linear fashion, you'll see minimal benefits from doing things in parallel.

I think you'll just have to try it both ways and measure the difference to see if it really works for you. Obviously, it will make not difference on a single core machine but an 8 core machine only processing two files may also not see any significant benefits, unless the key word search takes a long time per file, then doing them in parallel gets beneficial again. :) I think your best bet is to try a couple different spikes on the various techniques and figure out what is best for you and your situation.

Dave White
Parallel.ForEach() worked like a treat..almost 4 times faster than normal foreach.
ydobonmai
A: 

I would do a Producer Consumer approach (http://en.wikipedia.org/wiki/Producer-consumer_problem):

One thread queries your database (if possible through some sort of cursor so that you can do it one by one), and places each row in a buffer.

Another thread (or maybe more than one if the searching demands really much processing) is getting one row of the database (with your HTML blob) and processing the search.

In this case you can simultaneously do the querying and the processing.

I don't believe you will get much of a performance gain by the single fact that it is very likely that your querying takes far longer than the processing. The problem is that the querying part has disk reading as bottleneck. In the end your disk performance is very likely to be the one who limits your overall performance.

In order to check if it is this way, you could do the producer/consumer with more than one producer (i.e. more than one thread querying the database).

I hope it helps.

Eduardo

Edu