views:

89

answers:

1

I just started off with OpenMP using C++. My serial code in C++ looks something like this:

#include <iostream>
#include <string>
#include <sstream>
#include <vector>
#include <fstream>
#include <stdlib.h>

int main(int argc, char* argv[]) {
    string line;
    std::ifstream inputfile(argv[1]);

    if(inputfile.is_open()) {
        while(getline(inputfile, line)) {
            // Line gets processed and written into an output file
        }
    }
}

Because each line is pretty much independently processed, I was attempting to use OpenMP to parallelize this because the input file is in the order of gigabytes. So I'm guessing that first I need to get the number of lines in the input file and then parallelize the code this way. Can someone please help me out here?

#include <iostream>
#include <string>
#include <sstream>
#include <vector>
#include <fstream>
#include <stdlib.h>

#ifdef _OPENMP
#include <omp.h>
#endif

int main(int argc, char* argv[]) {
    string line;
    std::ifstream inputfile(argv[1]);

    if(inputfile.is_open()) {
        //Calculate number of lines in file?
        //Set an output filename and open an ofstream
        #pragma omp parallel num_threads(8)
        {
            #pragma omp for schedule(dynamic, 1000)
            for(int i = 0; i < lines_in_file; i++) {
                 //What do I do here? I cannot just read any line because it requires random access
            }
        }
    }
}

EDIT:

Important Things

  1. Each line is independently processed
  2. Order of the results don't matter
+1  A: 

Not a direct OpenMP answer - but what you are probably looking for is Map/Reduce approach. Take a look at Hadoop - it's done in Java, but there's some C++ API at least.

In general, you want to process this amount of data on different machines, not in multiple threads in the same process (virtual address space limitations, lack of physical memory, swapping, etc.) Also the kernel will have to bring the disk file in sequentially anyway (which you want - otherwise the hard-drive will just have to do extra seeks for each of your threads).

Nikolai N Fetissov
@Nikolai: Thanks for the explanation. What you said makes perfect sense now.
Legend