We need to read and count different types of messages/run some statistics on a 10 GB text file, e.g a FIX engine log. We use Linux, 32-bit, 4 CPUs, Intel, coding in Perl but the language doesn't really matter.
I have found some interesting tips in Tim Bray's WideFinder project. However, we've found that using memory mapping is inherently limited by the 32 bit architecture.
We tried using multiple processes, which seems to work faster if we process the file in parallel using 4 processes on 4 CPUs. Adding multi-threading slows it down, maybe because of the cost of context switching. We tried changing the size of thread pool, but that is still slower than simple multi-process version.
The memory mapping part is not very stable, sometimes it takes 80 sec and sometimes 7 sec on a 2 GB file, maybe from page faults or something related to virtual memory usage. Anyway, Mmap cannot scale beyond 4 GB on a 32 bit architecture.
We tried Perl's IPC::Mmap and Sys::Mmap. Looked into Map-Reduce as well, but the problem is really I/O bound, the processing itself is sufficiently fast.
So we decided to try optimize the basic I/O by tuning buffering size, type, etc.
Can anyone who is aware of an existing project where this problem was efficiently solved in any language/platform point to a useful link or suggest a direction?