I have some large files (from several gigabytes to hundreds of gigabytes) that I'm searching and trying to find every occurrence of a given string.
I've been looking into making this operate in parallel and have some questions.
How should I be doing this? I can't copy the entire file into memory since its too big. Will multiple FILE* pointers work?
How many threads can I put on the file before the disk bandwidth becomes a limiting factor, rather than the CPU? How can I work around this?
Currently, what I was thinking is I would use 4 threads, task each with a FILE* at either 0%, 25%, 50%, and 75% way through the file, and have each save its results to a file or memory, and then collect the results as a final step. Though with this approach, depending on bandwidth, I could easily add more threads and possibly get a bigger speedup.
What do you think?
EDIT: When I said memory bandwidth, I actually meant disk I/O. Sorry about that.