views:

114

answers:

3

I am tackling the challenge of using both the capabilities of a 8 core machine and a high-end GPU (Tesla 10).

I have one big input file, one thread for each core, and one for the the GPU handling. The Gpu thread, to be efficient, needs a big number of lines from the input, while the Cpu thread needs only one line to proceed (storing multiple lines in a temp buffer was slower). The file doesn't need to be read sequentially. I am using boost.

My strategy is to have a mutex on the input stream and each thread locks - unlocks it. This is not optimal because the gpu thread should have a higher precedence when locking the mutex, being the fastest and the most demanding one.

I can come up with different solutions but before rush into implementation I would like to have some guidelines.

What approach do you use / recommend ?

A: 

I would use a buffer. Have a single thread filling that buffer from disk. Each thread will lock the buffer, read data into the thread's buffer then release the lock on the mutex before processing the data.

Menox
+1  A: 

Some ideas:

1) Because the bottlenect is not in the IO the file should be kept almost entirely in RAM for easier access.

2) Implementation should not allow threads to block. It's better to have slightly non optimal solution if that reduces blocking.

Assuming we have big data file threads can employ shot in the dark tactics. This means once the thread acquires the lock, it just increments fpos and unlocks the memory. Then it grants itself privilege to process the part of the memory it just got. For example the thread could process all the lines which have their beginnings in the fragment.

Outcomes:

1) It's almost impossible for a thread to block. The lock times are very short (in the range of several instructions + time to flush caches)

2) Flexibility. Thread can take as much data as it wants to.

Of course, there should be some mechanisms to adapt to the length of line in the data file to avoid worst case scenario.

buratinas
you are suggesting to map the entire file in memory and then make each thread take a segment of the memory depending on its needs. This is interesting because is like @Menox buffer idea but you don't lock while processing the data. The only dark part is when a line is not entirely inside one single segment, I guess one thread can read over the next segment to complete the last line.
fabrizioM
Yes. Actually this is not a big problem because threads can safely read from anywhere in the file as it's read-only. The only issue is ownership. One of the mechanisms I can think of is that a thread owns a line if it finds preceding newline character in its segment. The first line of the file, of course, would be an exception then.
buratinas
+2  A: 

You may not need to lock at all if "1 line per thread" is not a strict requirement and you can go up to 2 lines or three lines sometimes. Then you can split the file equally, based on a formula. Suppose you want to read the file in 1024 kbyte blocks in total (this could be gigabytes too): You split it up to the cores with prioritization. So:

  • #define BLOCK_SIZE (1024 * 1024)
  • #define REGULAR_THREAD_BLOCK_SIZE (BLOCK_SIZE/(2 * NUM_CORES)) // 64kb
  • #define GPU_THREAD_BLOCK_SIZE (BLOCK_SIZE/2)
  • Each core gets 64 KB chunk
    • Core 1: offset 0 , size = REGULAR_THREAD_BLOCK_SIZE
    • Core 2: offset 65536 , size = REGULAR_THREAD_BLOCK_SIZE
    • Core 3: offset 131072 , size = REGULAR_THREAD_BLOCK_SIZE
    • Core n: offset (n * REGULAR_THREAD_BLOCK_SIZE), size = REGULAR_THREAD_BLOCK_SIZE
  • GPU gets 512 KB, offset = (NUM_CORES * REGULAR_THREAD_BLOCK_SIZE), size = GPU_THREAD_BLOCK_SIZE

So ideally they don't overlap. There are cases where they can overlap though. Since you're reading a text file a line might fall into the next core's block. To avoid overlapping you always skip first line for other cores, and always complete the last line assuming the next thread would skip it anyway, here is pseudo code:

void threadProcess(buf, startOFfset, blockSize)
{
    int offset = startOffset;
    int endOffset = startOffset + blockSize;
    if(coreNum > 0) {
        // skip to the next line
        while(buf[offset] != '\n' && offset < endOffset) offset++;
    }
    if(offset >= endOffset) return; // nothing left to process
    // read number of lines provided in buffer
    char *currentLine = allocLineBuffer(); // opening door to security exploits :)
    int strPos = 0;
    while(offset < endOffset) {
        if(buf[offset] == '\n') {
            currentLine[strPos] = 0;
            processLine(currentLine); // do line processing here
            strPos = 0; // fresh start
            offset++;
            continue;
        }
        currentLine[strPos] = buf[offset];
        offset++;
        strPos++;
    }
    // read the remaineder past the buf
    strPos = 0;
    while(buf[offset] != '\n') {
        currentLine[strPos++] = buf[offset++];
    }
    currentLine[strPos] = 0;
    processLine(currentLine); // process the carryover line
}

As you can see this parallelizes the processing of the read block not the reads themselves. How do you parallelize reads? The best most awesome way would be memory mapping the whole block into memory. That would gain the best I/O performance as it's the lowest level.

ssg
+1 Yes I think this is going toward the best solution. The only concerns is when the GPU ends before the threads and could 'steal' the remaining segments from the other threads. Nice macros.
fabrizioM
In this case you can do two things: 1) You can change the size of GPU's slice (instead of half, make it a quarter etc) 2) You can change the total block size you map at once. Both would change number of lines GPU processes at one shot.
ssg
+1 well written answer and I agree with this approach, locking would be completely unnecessary in this instance
SpaceghostAli