views:

841

answers:

12

We need to read and count different types of messages/run some statistics on a 10 GB text file, e.g a FIX engine log. We use Linux, 32-bit, 4 CPUs, Intel, coding in Perl but the language doesn't really matter.

I have found some interesting tips in Tim Bray's WideFinder project. However, we've found that using memory mapping is inherently limited by the 32 bit architecture.

We tried using multiple processes, which seems to work faster if we process the file in parallel using 4 processes on 4 CPUs. Adding multi-threading slows it down, maybe because of the cost of context switching. We tried changing the size of thread pool, but that is still slower than simple multi-process version.

The memory mapping part is not very stable, sometimes it takes 80 sec and sometimes 7 sec on a 2 GB file, maybe from page faults or something related to virtual memory usage. Anyway, Mmap cannot scale beyond 4 GB on a 32 bit architecture.

We tried Perl's IPC::Mmap and Sys::Mmap. Looked into Map-Reduce as well, but the problem is really I/O bound, the processing itself is sufficiently fast.

So we decided to try optimize the basic I/O by tuning buffering size, type, etc.

Can anyone who is aware of an existing project where this problem was efficiently solved in any language/platform point to a useful link or suggest a direction?

A: 

I seem to recall a project in which we were reading big files, Our implementation used multithreading - basically n * worker_threads were starting at incrementing offsets of the file (0, chunk_size, 2xchunk_size, 3x chunk_size ... n-1x chunk_size) and was reading smaller chunks of information. I can't exactly recall our reasoning for this as someone else was desining the whole thing - the workers weren't the only thing to it, but that's roughly how we did it.

Hope it helps

Maciek
+2  A: 

Have you thought of streaming the file and filtering out to a secondary file any interesting results? (Repeat until you have a manageble size file).

Paul Nathan
+3  A: 

Perhaps you've already read this forum thread, but if not:

http://www.perlmonks.org/?node%5Fid=512221

It describes using Perl to do it line-by-line, and the users seem to think Perl is quite capable of it.

Oh, is it possible to process the file from a RAID array? If you have several mirrored disks, then the read speed can be improved. Competition for disk resources may be what makes your multiple-threads attempt not work.

Best of luck.

Jon
+3  A: 

I wish I knew more about the content of your file, but not knowing other than that it is text, this sounds like an excellent MapReduce kind of problem.

PS, the fastest read of any file is a linear read. cat file > /dev/null should be the speed that the file can be read.

Autocracy
Indeed; my colleague working on a similar problem was using the timing from cat to track down other issues in file read speeds. NFS was an awful time suck. :(
brian d foy
+1  A: 

Basically need to "Divide and conquer", if you have a network of computers, then copy the 10G file to as many client PCs as possible, get each client PC to read an offset of the file. For added bonus, get EACH pc to implement multi threading in addition to distributed reading.

Darknight
"the problem is really IO bound" <--- good luck copying the file to a machine faster than the disks can read it.
derobert
+1  A: 

Parse the file once, reading line by line. Put the results in a table in a decent database. Run as many queries as you wish. Feed the beast regularly with new incoming data.

Realize that manipulating a 10 Gb file, transferring it across the (even if local) network, exploring complicated solutions etc all take time.

Sinan Ünür
Feed database and run queries can take magnitude more time than do all processing in perl. (It's from mine experience even you use bulk load and MySQL which is one of fastest approaches what you can use.)
Hynek -Pichi- Vychodil
Once you have the data in a *decent* database, you can run as many queries as you want (even those you did not know you may have wanted to run) with little extra cost.
Sinan Ünür
+1  A: 

I have a co-worker who sped up his FIX reading by going to 64-bit Linux. If it's something worthwhile, drop a little cash to get some fancier hardware.

brian d foy
+4  A: 

This all depends on what kind of preprocessing you can do and and when. On some of systems we have, we gzip such large text files, reducing them to 1/5 to 1/7 of their original size. Part of what makes this possible is we don't need to process these files until hours after they're created, and at creation time we don't really have any other load on the machines.

Processing them is done more or less in the fashion of zcat thosefiles | ourprocessing.(well it's done over unix sockets though with a custom made zcat). It trades cpu time for disk i/o time, and for our system that has been well worth it. There's ofcourse a lot of variables that can make this a very poor design for a particular system.

nos
+6  A: 

Most of the time you will be I/O bound not CPU bound, thus just read this file through normal Perl I/O and process it in single thread. Unless you prove that you can do more I/O than your single CPU work, don't waste your time with anything more. Anyway, you should ask: Why on Earth is this in one huge file? Why on Earth don't they split it in a reasonable way when they generate it? It would be magnitude more worth work. Then you can put it in separate I/O channels and use more CPU's (if you don't use some sort of RAID 0 or NAS or ...).

Measure, don't assume. Don't forget to flush caches before each test. Remember that serialized I/O is a magnitude faster than random.

Hynek -Pichi- Vychodil
+1  A: 

hmmm, but what's wrong with the read() command in C? Usually has a 2GB limit, so just call it 5 times in sequence. That should be fairly fast.

+1  A: 

If you are I/O bound and your file is on a single disk, then there isn't much to do. A straightforward single-threaded linear scan across the whole file is the fastest way to get the data off of the disk. Using large buffer sizes might help a bit.

If you can convince the writer of the file to stripe it across multiple disks / machines, then you could think about multithreading the reader (one thread per read head, each thread reading the data from a single stripe).

Keith Randall
A: 

Its not stated in the problem that sequence matters really or not. So, divide the file into equal parts say 1GB each, and since you are using multiple CPUs, then multiple threads wont be a problem, so read each file using separate thread, and use RAM of capacity > 10 GB, then all your contents would be stored in RAM read by multiple threads.

pokrate