views:

439

answers:

5

I have the following problem situation. A bunch of data is split across 10k small files (approx. 8-16 kib each). Depending on user input, I have to load these as fast as possible, and process them. More precisely, each data packet can be split into 100-100k files, and there are approximately 1k data packets. Most of them are the smaller ones, though.

Right now, I'm using a thread pool and on each file access, the next free thread opens the file, reads it, and returns the data prepared for displaying. As the number of files is going to grow in the future, I'm not so happy with this approach, especially if it's likely to end up with something around 100k or more files (deploying this would be surely fun ;) ).

So, the idea is to combine all of these tiny files for one data packet into a large one, and read from that. I can guarantee that it's going to be read-only, but I don't know the number of threads that will be accessing one file concurrently up front (I do know the maximum number). This would give me around 1000 good sized files, and I can easily add new data packets.

The question is: How can I allow 1..N threads to read efficiently from a single file in this scenario? I can use asynchronous I/O on Windows, but it's supposed to become synchronous for reads smaller than 64k. Memory mapping the file is not an option, as the expected size is > 1.6 GiB, and I still need to be able to run on x86 (unless I can efficiently map some tiny part, read it, unmap it again -- my experience with memory mapping was that it brings quite some overhead compared to a single read).

I thought about opening each of the data packets N times, and give each thread a handle in a round-robin fashion, but the problem is that it can end up with (number of data files)x(maximum number of threads) open handles (can become easily 8-16k), and I would have to synchronize on each access to a data packet, or use some lock-free magic, to get the next free file handle.

As this does not seem to be an original problem (I guess, any database engine has a similar one, where you can have M tables (data packets) with N rows (files in my case), and you want to allow as many threads as possible to read rows concurrently). So what's the recommended practice here? BTW, it should ran on Windows and Linux, so portable approaches are welcome (or at least approaches which work on both platforms, even if they use different underlying APIs -- as long as they can be wrapped, I'm happy).

[EDIT] This is not about the speed, this is about hiding the latency. That is, I read like 100 of those tiny files per second maybe, so I'm at 1 mib/s at most. My main concern is the seek times (as my access pattern is not predictable), and I want to hide them, by firing off the reading while displaying the old data to the user. The question is how to allow several threads to issue IO requests over several files, with possibly >1 thread accessing a single file.

It's really no problem if one of the calls takes 70 ms or so to finish, but I can't afford if the read call blocks.

+2  A: 

I don't think that multi-threading will help you very much with the disk reads. Assuming the file is on one disk platter, you have only one set of read heads to access it with, so you are serialised right there.

In this situation I think I would have one disk read process that read the file sequentially into buffers (this would hopefully maximise read performance, as the read heads would not need to move a bout too much, assuming a fairly unfragmented data file) and a number of processing threads that read the buffers, marking them as free when they completed the processing.

However you choose to proceed, can I suggest that you make sure that your code is tructured in such a way that the number of different types of threads is easily configurable, ideally from the executables command line. In situations like this you will want to experiment with different thread configurations to find the optimal numbers for your specific situation.

anon
Yeah, but I don't want to block each time I access a file. The thing is, I can continue displaying the current data while I'm waiting for the disk seek time, and I want to reduce stalls as much as possible.
Anteru
Maybe more than one read thread, just in case the volume is RAID1.
Steve Jessop
A: 

The fastest way I can imagine to read a big chunk of data is to create a disk partition (either primary or logical, but no LVM), and read directly the partition device (e.g. /dev/sda5) sequentially, without a filesystem, using only a single thread per disk. It is important to access the raw disk sequentially, to avoid disk seeks, which are much slower than sequential reads.

pts
A: 

The problem that's going to hurt you is head contention; it doesn't matter how many threads you have running, the head can only be in one position at a time. Do you have the option of distributing the file across multiple disks?

Charlie Martin
Possibly, but again, it's not about the read speed, it's just about hiding the latency efficiently.
Anteru
A: 

The mmap approach come in handy. You don't need to do a mmap/unmap cycle for each read, but have a thread handling all these mapping and handing pointers (actually an offset and a length). The real reading part will be scheduled by the OS when a thread will access the virtual memory mapped to the file.

Just keep in mind that too much threads won't improve the reading speed. The database engines usually have a quite limited number of I/O threads that serves all the I/O need of the application threads.

Steve Schnepp
mmap is fine, but the total size of the files is too large to mmap them on x86. I could have a file handle open, and mmap the part I need to read every time (under the assumption that two threads can concurrently `mmap()` on the same file), but I'm wondering whether this is any faster than reopening the file and issuing another `read`.
Anteru
You don't need to mmap the whole file at once. Just mmap the parts you need, but just do one mmap to read 100 records at once, so you can amortize the mmap/unmap cost on 100 reads. BTW, an open/close cycle is also quite slow, and read usually involves a buffer copy operation.If your data is mostly to be only read once as it seems, you can even think of the O_STREAM and/or O_DIRECT open flags.
Steve Schnepp
+1  A: 

Linux doesn't have usable asynchronous IO at all (yes, there's aio_*, but it only works on O_DIRECT and has all kinds of weird limitations) so if you want something portable you're just going to have to use normal read calls. mmap would work, but the cost of altering the mapping might be a bit high if you're only going to read a small amount each time.

Now, I don't know about Windows, but on Linux there's a pread() function that lets you read from a file descriptor at a given offset, without affecting the file descriptor's seek pointer. With this, you could have any number of threads reading from the same file without needing to lock the file descriptor or anything silly like that.

bdonlan