tags:

views:

96

answers:

3

I have some large files (from several gigabytes to hundreds of gigabytes) that I'm searching and trying to find every occurrence of a given string.

I've been looking into making this operate in parallel and have some questions.

  1. How should I be doing this? I can't copy the entire file into memory since its too big. Will multiple FILE* pointers work?

  2. How many threads can I put on the file before the disk bandwidth becomes a limiting factor, rather than the CPU? How can I work around this?

Currently, what I was thinking is I would use 4 threads, task each with a FILE* at either 0%, 25%, 50%, and 75% way through the file, and have each save its results to a file or memory, and then collect the results as a final step. Though with this approach, depending on bandwidth, I could easily add more threads and possibly get a bigger speedup.

What do you think?

EDIT: When I said memory bandwidth, I actually meant disk I/O. Sorry about that.

+1  A: 

I doubt memory bandwidth will be as big of a problem as disk IO limitations. With most hardware, you're going to be very restricted on how each thread can read from disk -

If you want to maximize throughput, you may need to do something like have one thread who's job is to handle disk IO (most hardware can only stream one chunk from disk at a time, so that'll be a limiting factor). It can then take this and push off chunks of memory to individual threads in some type of thread pool to process.

My guess is that your processing will be fast - probably much faster than the disk IO - but if it's slow, having multiple processing threads could speed up your entire operation.

Multiple FILE* pointers will work - but may actually be slower than just having a single one, since they'll end up time slicing to read the file, and you'll be jumping around on your disk more.

Reed Copsey
+5  A: 

With this new revised version of the question, the answer is "almost immediately". Hard disks aren't very good at reading from two places on the disk at the same time. :) If you had multiple hard drives and split your file across them, you could probably take advantage of some threading. To be fair, though, I would say that the disk speed is already the limiting factor. I highly doubt that your disk can read data faster than the processor can handle it.

Nick Lewis
Couldn't RAID configurations speed up multiple-sector reading dramatically?
xtofl
Yes, as an appropriate RAID configuration would essentially do the same as splitting the file across multiple disks would. I would also think that solid-state disks might provide a substantial improvement, as they can read from multiple-sectors at once due to the lack of a need to spin. That's why I asked for more information about his ability to upgrade his hardware. I think the I/O would still probably be the bottleneck, though. Processors are FAST FAST FAST.
Nick Lewis
A: 

if you are using a SSD drive. you may overcome this problem with parallel searching through the file with multiple file pointers.

Janaka