tags:

views:

289

answers:

6

I need to read a single file using multiple threads under Linux. There are reading operations only and no need of writing. The file reading don't need read the whole file every time. It need read one or more portions of a file every time. I store the offset of each portion beforehand. The file is too large to put into main memory.

So for example, many users want to read such file. I use a thread or a process to read the file to answer user requests. What will happen under Linux? Will all the read operations be queued? And the OS will complete the file reading one by one? Is it possible to improve the performance of such operations?

I'm trying to implement a simple inverted index used in information retrieval. I put dictionary in memory and posting lists in files. Each file contains a segment of the index. In the dictionary, I can store something like offset to point to the position of the word's posting list. When 100 users want to search something in one second, they submit different queries. So each reading will read different part of the file.

+3  A: 

Try to implement it in the simplest possible way to start with - let the OS deal with making it efficient by caching etc. See what the performance is like - it may well not turn out to be the bottleneck at all. OSes are generally good at this sort of thing :)

Assuming you are able to open the file multiple times for shared reading, I'd expect it to work fine, without all the read operations being queued.

Jon Skeet
Happy 100k!!! :-)
Vinko Vrsalovic
+1  A: 

Operating systems are usually quite good at optimizing access to files (Linux is known for aggressive caching.) But I think that reducing the amount of reads is paramount to increase efficiency, do you really cannot get away with a single shared data structure representing a piece of the file? That way a single thread reads, and every other thread benefits from the reading. As it's only reading, there shouldn't be any contention on the data structure, only while it is being populated. This is of course not feasible if each thread will read a different part of the file each time.

Given that you cannot either benefit (much) from caching nor share the read portion of the file, there isn't much to do (just read the file) but to improve your disk subsystem: Get fast disks with lots of throughput (RAID 10). If that is not enough, make two or more copies of the file on different logical drives to be able to increase the throughput even more.

Vinko Vrsalovic
I'm trying to implement a simple inverted index used in information retrieval. I put dictionary in memory and posting lists in files.Each file contains a segmentation of the index. In the dictionary, I can store something like offset to point to the position of the word's posting list. When 100 users want to search something, they submit different queries. So each reading will read different part of the file. You mentioned caching, but if the dictionary use large portion of memory, how will Linux cache file, provided the file is large.
Stephen Hsu
+2  A: 

The threads can all safely read the file independently, yes. Ultimately the read operations will be queued at the OS level, so the driver serialises read requests to the disk. Depending on the access strategy (ie. read buffer sizes), the reads should be interleaved. Unless you try to read the entire file in one request (which you shouldn't be since you said it is too big to fit in memory) then the read requests will be serviced in approximately the order the threads request them. (I say approximately, as the disk driver can reorder read requests that it knows about in the queue to optimize disk access). So what you describe should work fine. And the OS will fairly aggressively cache reads (and preload) as much as it can.

As for improving the performance, there are many possibilities depending on the data and the algorithm used. Is it really necessary for each thread to read the entire file to service each request? Why read the same data over and over? Can't you centralise some of the information so threads can share the data read? It sounds like an expensive solution. And if you are repeatedly reading a file that's larger than RAM over and over, recently cached blocks that have a good chance of being re-read may get pushed out of the cache. Perhaps an index of the file could save you some read time, and you cache access based on the index? Also consider using mmap() to map the file into memory, then the OS will page blocks in and out as the threads read from different chunks. So it is worth rethinking how the data is accessed, just what you need and when. If you post more info here, people may be able to offer more specific suggestions.

Remember, the most efficient operation is the one you don't perform!

gavinb
I said something wrong when I ask. The file reading don't need read the whole file every time. It need read one or more portions of a file every time. I store the offset of each portion beforehand.
Stephen Hsu
In that case, you could have a single class responsible for reading, and have the threads request chunks. It can maintain an index of offsets, and cache blocks recently read. But first, it would be worth doing some profiling to see just where you need to optimize. Even a simple `strace` can give you an idea of the read patterns.
gavinb
+2  A: 

How big is your file that it won't all fit in memory?

It would be most efficient to punt to the o/s, and use mmap() to map the file into (virtual) memory, and then let the threads all access the file via memory. If you're on a 32-bit machine, that limits your file size to 'something under 4GB, but probably well over 2 GB'; if you're on a 64-bit machine, you aren't really limited except by disk space.

Note that the file need not all be in physical memory with mmap(); however, it will all be there logically.

Jonathan Leffler
One file can be 2GB, but how about if there are 10 such files? And there is only 2G free physical memory remained, for example.
Stephen Hsu
On a 32-bit machine, you're snookered. On a 64-bit machine, the o/s takes care of the paging of the files - the bits that are in use by a thread will be in memory and the bits that aren't will be on disk.
Jonathan Leffler
A: 

If the file is too large to fit into system memory and you have lots of threads that need to read the whole file, there is a good chance that your application is going to be limited by the disk I/O ... no matter how you read the file, and no matter how smart the OS is.

If this is unacceptable, then you'll need to come up with an alternative architecture for your application. For example, you might convert the file into a different form that allows the threads to fetch the information they need w/o reading the entire file. Or you might turn the application into separate processes running on separate machines, each with their own copy of the file. A third possibility would be to add a thread whose sole purpose is to read and buffer the file, and have the existing threads read from the buffers. (By having the worker threads all work on the same region of the file, you avoid the OS from having to read parts of the file from disk multiple times. If the application is really disk-bound, this may speed it up.)

However, all of this is conjecture. It is hard to give decent advice without more information on the application and the file it processes.

EDIT: based on your follow-up comments, it seems that the threads don't need to access all of the file after all. My first suggestion is moot (you are all ready doing that!), and my third suggestion won't help. I suggest that you do as @Jon Skeet says and implement the system in a simple way. Then if there are performance issues, look for ways to make it faster/better. For example:

  • Consider implementing an in-memory cache of recent queries and their results.
  • Consider using multiple machines, and partitioning the index file by keyword range so that each part will fit into memory on one machine.
  • If you support complex queries, consider splitting them into simple ones and sending the simple queries to different machines (e.g. based on keyword partitioning) then combining the result-sets.
  • Consider ways to avoid computing huge result-sets when the user only wants to look at the first few results.

I borrowed an interesting textbook on indexing from a colleague a couple of years ago. I think it was Managing Gigabytes by Witten et al.

Stephen C
Hi, I updated the question and explained my application.
Stephen Hsu
A: 

Points to be noted

  • There is a single driver, with single drive
  • And there is multiple (random) access from multiple threads

In this case, the since below your multiple thread, things are serial (from the driver layer) ... so, the best thing you can do,

  • Increase the priority of your process (if possible), so that other processes do not eat-up CPU time
  • Allocate fair level scheduling between threads
  • Based on the randomness of access (you can enable or disable cache)
    • For example, you can disable cache, if the reads are completely random and you see that there is cache miss most of the times
Alphaneo