views:

30

answers:

1

Hello all,

I’m working on tuning performance on a high-performance, high-capacity data engine which ultimately services an end-user web experience. Specifically, the piece delegated to me revolves around characterizing multi-threaded file IO and memory mapping of the data to local cache. In writing test applications to isolate the timing tall-poles, several questions have been exposed. The code has been minimized to perform only a system file open (open(O_RDONLY)) call. I’m hoping that the result of this query helps us understand the fundamental low-level system processes so that a complete predictive (or at least relational) timing model can be understood. Suggestions are always welcome. We’ve seemed to hit a timing barrier, and would like to understand the behavior and determine whether that barrier can be broken.

The test program: 1) Is written in C, compiled using the gnu C compiler as noted below; 2) Is minimally written to isolate the discovered issues to a single system file “open()”; 3) Is configurable to simultaneously launch a requested number of pthreads; 4) loads a list of 1000 text files of ~8K size; 5) creates the threads (simply) with no attribute modifications; 6) each thread performs multiple, sequential file open() calls on the next available file from the pre-determined list of files until the file list is exhausted in such a way that a single thread should open all 1000 files, 2 threads should theoretically open 500 files (not proven as of yet), etc.);

We’ve run tests multiple times, parametrically varying the thread count, file sizes, and whether the files are located on a local or remote server. Several questions have come up.

Observed results(opening remote files): 1) File open times are higher the first time through (as expected, due to file caching); 2) Running the test app with one thread to load all the remote files takes X seconds; 3) It appears that running the app with a thread count between 1 and # of available CPUs on the machine results in times that are proportional to the number of CPUs (nX seconds). 4) Running the app using a thread count > #CPUs results in run times that seem to level out at the approx same value as the time is takes to run with #CPUs threads (is this coincidental, or a systematic limit, or what?). 5) Running multiple, concurrent processes (for example, 25 concurrent instances of the same test app) results in the times being approximately linear with number of processes for a selected thread count. 6) Running app on different servers shows similar results

Observed results (opening files residing locally): 1) Orders of magnitude faster times (as to be expected); 2) With increasing the thread count, a LOW timing inflection point occurs at around 4-5 active threads, then increases again until the number of threads equals the CPU count, then levels off again; 3) Running multiple, concurrent processes (same test) results in the times being approximately linear with number of processes for a constant thread count (same result as #5 above).

Also, we noticed that Local opens take about .01 ms and sequential network opens are 100x slower at 1ms. Opening network files, we get a linear throughput increase up to 8x with 8 threads, but 9+ threads do nothing. The network open calls seem to block after more than 8 simultaneous requests. What we expected was an initial delay equal to the network roundtrip, and then approximately the same throughput as local. Perhaps there is extra mutex locking done on the local and remote systems that takes 100x longer. Perhaps there is some internal queue of remote calls that only holds 8.

Expected results and questions to be answered either by test or by answers from forums like this one: 1) Running multiple threads would result in the same work done in shorter time; 2) Is there an optimal number of threads; 3) Is there a relationship between the number of threads and CPUs available? 4) Is there some other systematic reason that an 8-10 file limit is observed? 5) How does the system call to “open()” work in a multi-threading process? 1. Each thread gets its context-switched time-slice; 2. Does the open() call block and wait until the file is open/loaded into file cache? Or does the call allow context switching to occur while the operation is in progress? 3. When the open() completes, does the scheduler reprioritize that thread to execute sooner, or does the thread have to wait until its turn in round-robin way; 4. Would having the mounted volume on which the 1000 files reside set as read-only or read/write make a difference? 5. When open() is called with a full path, is each element in the path stat()ed? Would it make more sense to open() a common directory in the list of files tree, and then open() the files under that common directory by relative path?

Development test setup: Red Hat Enterprise Linux Server release 5.4 (Tikanga)

8-CPUS, each with characteristics as shown below:

processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU X5460 @ 3.16GHz stepping : 6 cpu MHz : 1992.000 cache size : 6144 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 4 apicid : 1 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm bogomips : 6317.47 clflush size : 64 cache_alignment : 64 address sizes : 38 bits physical, 48 bits virtual power management:

GNU C compiler, version: gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46)

+1  A: 

Not sure if this is one of your issues, but it may be of use.

The one thing that struck me, while optimizing thousands of random reads on a single SATA disk, was that performing non-blocking I/O isn't so easy to do in linux in a clean way, without extra threads.

It is (currently) impossible to issue a non-blocking read() on a block device; i.e. it will block for the 5 ms seek time the disk needs (and 5 ms is an eternity, at 3 GHz). Specifying O_NONBLOCK to open() only served some purpose for backward compatibility, with CD burners or something (this was a rather vague issue). Normally, open() doesn't block or cache anything, it's mostly just to get a handle on a file to do some data I/O later.

For my purposes, mmap() seemed to get me as close to the kernel handling of the disk as possible. Using madvise() and mincore() I was able to fully exploit the NCQ capabilities of the disk, which was simply proved by varying the queue depth of outstanding requests, which turned out to be inversely proportional to the total time taken to issue 10k reads.

Thanks to 64 bit memory addressing, using mmap() to map an entire disk to memory is no problem at all. (on 32 bit platforms, you would need to map the parts of the disk you need using mmap64())

mvds