ansaurus

Question

What strategies are efficient to handle concurrent reads on heterogeneous multi-core architectures?

Answer 1

A:

I would use a buffer. Have a single thread filling that buffer from disk. Each thread will lock the buffer, read data into the thread's buffer then release the lock on the mutex before processing the data.

Menox 2010-04-17 02:48:26

Answer 2

+1 A:

Some ideas:

1) Because the bottlenect is not in the IO the file should be kept almost entirely in RAM for easier access.

2) Implementation should not allow threads to block. It's better to have slightly non optimal solution if that reduces blocking.

Assuming we have big data file threads can employ shot in the dark tactics. This means once the thread acquires the lock, it just increments fpos and unlocks the memory. Then it grants itself privilege to process the part of the memory it just got. For example the thread could process all the lines which have their beginnings in the fragment.

Outcomes:

1) It's almost impossible for a thread to block. The lock times are very short (in the range of several instructions + time to flush caches)

2) Flexibility. Thread can take as much data as it wants to.

Of course, there should be some mechanisms to adapt to the length of line in the data file to avoid worst case scenario.

buratinas 2010-04-17 20:25:28

you are suggesting to map the entire file in memory and then make each thread take a segment of the memory depending on its needs. This is interesting because is like @Menox buffer idea but you don't lock while processing the data. The only dark part is when a line is not entirely inside one single segment, I guess one thread can read over the next segment to complete the last line.

fabrizioM 2010-04-18 12:23:48

Yes. Actually this is not a big problem because threads can safely read from anywhere in the file as it's read-only. The only issue is ownership. One of the mechanisms I can think of is that a thread owns a line if it finds preceding newline character in its segment. The first line of the file, of course, would be an exception then.

buratinas 2010-04-24 18:38:30

Answer 3

+2 A:

You may not need to lock at all if "1 line per thread" is not a strict requirement and you can go up to 2 lines or three lines sometimes. Then you can split the file equally, based on a formula. Suppose you want to read the file in 1024 kbyte blocks in total (this could be gigabytes too): You split it up to the cores with prioritization. So:

#define BLOCK_SIZE (1024 * 1024)
#define REGULAR_THREAD_BLOCK_SIZE (BLOCK_SIZE/(2 * NUM_CORES)) // 64kb
#define GPU_THREAD_BLOCK_SIZE (BLOCK_SIZE/2)
Each core gets 64 KB chunk
- Core 1: offset 0 , size = REGULAR_THREAD_BLOCK_SIZE
- Core 2: offset 65536 , size = REGULAR_THREAD_BLOCK_SIZE
- Core 3: offset 131072 , size = REGULAR_THREAD_BLOCK_SIZE
- Core n: offset (n * REGULAR_THREAD_BLOCK_SIZE), size = REGULAR_THREAD_BLOCK_SIZE
GPU gets 512 KB, offset = (NUM_CORES * REGULAR_THREAD_BLOCK_SIZE), size = GPU_THREAD_BLOCK_SIZE

So ideally they don't overlap. There are cases where they can overlap though. Since you're reading a text file a line might fall into the next core's block. To avoid overlapping you always skip first line for other cores, and always complete the last line assuming the next thread would skip it anyway, here is pseudo code:

void threadProcess(buf, startOFfset, blockSize)
{
    int offset = startOffset;
    int endOffset = startOffset + blockSize;
    if(coreNum > 0) {
        // skip to the next line
        while(buf[offset] != '\n' && offset < endOffset) offset++;
    }
    if(offset >= endOffset) return; // nothing left to process
    // read number of lines provided in buffer
    char *currentLine = allocLineBuffer(); // opening door to security exploits :)
    int strPos = 0;
    while(offset < endOffset) {
        if(buf[offset] == '\n') {
            currentLine[strPos] = 0;
            processLine(currentLine); // do line processing here
            strPos = 0; // fresh start
            offset++;
            continue;
        }
        currentLine[strPos] = buf[offset];
        offset++;
        strPos++;
    }
    // read the remaineder past the buf
    strPos = 0;
    while(buf[offset] != '\n') {
        currentLine[strPos++] = buf[offset++];
    }
    currentLine[strPos] = 0;
    processLine(currentLine); // process the carryover line
}

As you can see this parallelizes the processing of the read block not the reads themselves. How do you parallelize reads? The best most awesome way would be memory mapping the whole block into memory. That would gain the best I/O performance as it's the lowest level.

ssg 2010-04-18 10:36:33

+1 Yes I think this is going toward the best solution. The only concerns is when the GPU ends before the threads and could 'steal' the remaining segments from the other threads. Nice macros.

fabrizioM 2010-04-18 12:31:02

In this case you can do two things: 1) You can change the size of GPU's slice (instead of half, make it a quarter etc) 2) You can change the total block size you map at once. Both would change number of lines GPU processes at one shot.

ssg 2010-04-18 12:53:35

+1 well written answer and I agree with this approach, locking would be completely unnecessary in this instance

SpaceghostAli 2010-04-21 16:16:09

ansaurus

tags:

views:

answers:

What strategies are efficient to handle concurrent reads on heterogeneous multi-core architectures?

related questions