You may not need to lock at all if "1 line per thread" is not a strict requirement and you can go up to 2 lines or three lines sometimes. Then you can split the file equally, based on a formula. Suppose you want to read the file in 1024 kbyte blocks in total (this could be gigabytes too): You split it up to the cores with prioritization. So:
#define BLOCK_SIZE (1024 * 1024)
#define REGULAR_THREAD_BLOCK_SIZE (BLOCK_SIZE/(2 * NUM_CORES)) // 64kb
#define GPU_THREAD_BLOCK_SIZE (BLOCK_SIZE/2)
- Each core gets 64 KB chunk
- Core 1: offset 0 , size = REGULAR_THREAD_BLOCK_SIZE
- Core 2: offset 65536 , size = REGULAR_THREAD_BLOCK_SIZE
- Core 3: offset 131072 , size = REGULAR_THREAD_BLOCK_SIZE
- Core n: offset
(n * REGULAR_THREAD_BLOCK_SIZE)
, size = REGULAR_THREAD_BLOCK_SIZE
- GPU gets 512 KB, offset =
(NUM_CORES * REGULAR_THREAD_BLOCK_SIZE)
, size = GPU_THREAD_BLOCK_SIZE
So ideally they don't overlap. There are cases where they can overlap though. Since you're reading a text file a line might fall into the next core's block. To avoid overlapping you always skip first line for other cores, and always complete the last line assuming the next thread would skip it anyway, here is pseudo code:
void threadProcess(buf, startOFfset, blockSize)
{
int offset = startOffset;
int endOffset = startOffset + blockSize;
if(coreNum > 0) {
// skip to the next line
while(buf[offset] != '\n' && offset < endOffset) offset++;
}
if(offset >= endOffset) return; // nothing left to process
// read number of lines provided in buffer
char *currentLine = allocLineBuffer(); // opening door to security exploits :)
int strPos = 0;
while(offset < endOffset) {
if(buf[offset] == '\n') {
currentLine[strPos] = 0;
processLine(currentLine); // do line processing here
strPos = 0; // fresh start
offset++;
continue;
}
currentLine[strPos] = buf[offset];
offset++;
strPos++;
}
// read the remaineder past the buf
strPos = 0;
while(buf[offset] != '\n') {
currentLine[strPos++] = buf[offset++];
}
currentLine[strPos] = 0;
processLine(currentLine); // process the carryover line
}
As you can see this parallelizes the processing of the read block not the reads themselves. How do you parallelize reads? The best most awesome way would be memory mapping the whole block into memory. That would gain the best I/O performance as it's the lowest level.