views:

436

answers:

5

I have a C++ application running on Windows that wakes up every 15 mins to open & read files present in a directory. The directory changes on every run.

  • open is performed by *ifstream.open(file_name, std::ios::binary)*
  • read is performed by streambuf ios::rdbuf()*
  • Total number of files every 15 mins is around 50,000
  • The files are opened & read in batches of 20
  • The size of each file is around 50 Kbytes

For each run; this operation (open & read) takes around 18-23 mins on a dual-core machine with disk spindle speed of 6000 RPM. I have captured the memory page fault /sec and they are in the range of 8000 – 10000.

Is there a way to reduce the page faults and optimize file open & read operation?

Gowtham

+2  A: 

Don't use STL if you can avoid it. It handles very difficult internationalization and translation/transformation issues which makes it slow.

Most often the fastest way to read a file is to memory-map it (also in windows, CreateFileMapping as starting point. If at all possible, use a single file with total size of 50'000*50K and directly index that file when writing/reading. You should also consider using a DB (even SQLite) if data is at all structured. This amount of data is so small that it should stay in memory at all times. You could also try using ramdisk to avoid going to disk at all (this will tax your error recovery in case of hardware/electricity failure).

Pasi Savolainen
"It handles very difficult internationalization and translation/transformation issues which makes it slow." This is entirely dependent on the implementation. If you're performing the read operations at the streambuf level then there should be no i18n formatting issues and it is entirely reasonable (even preferable) for the implementation not to perform and encoding transformations, passing through the bytes as stored on disc.
Charles Bailey
On a dual core machine with a 6000RPM disk, I'd be slightly surprised if he has 2.5GB of RAM to spare for a RAMdisk. But it would certainly speed things up if he did.
Steve Jessop
A: 

According to MS PSDK documentation, file caching may be used. And, IMHO, instead of STL, windows native CreatFile, ReadFile and CloseHandle with appropriate flags may get a better performance since you mentioned windows.

But, on the other hand, according to your post, it seems you only read. So, caching may not increase performance significantly. But, since CPU is fast and disk i/o are usually slow, you may still use these kind of intermediate buffers concept together with multithreading, meaning running parallel read threads.

antreality
A: 
  1. Maybe you can use something like memoisation, i.e. if file did not change (you can save it's last update time) then you can use it from the last time, i.e keep something in memory instead.

  2. I think you don't need FS caching. I.e. it'll be better to open files in O_DIRECT mode (it's linux, but i'm sure Windows has something similar) and read every file in one I/O, i.e. create buffer in memory of the file size and read into it. This should reduce CPU and memory usage very much.

  3. Multi threading, suggested above, will also help, but not much. I suspect the bottle neck is the disk, which can perform limited amount of I/O operations per second (100 can be an estimate). That's why you need to reduce the amount of I/O operations, like using (1), (2) described above or something else.

Drakosha
A: 

First; thanks for all the answers. It was very helpful and provided us with many avenues to explore.

We removed STL and used C (fopen & fread). This provided us a slight improvement with the Open & Read operation for the above mentioned data taking 16 - 17 mins.

We really nailed the problem by compressing these files. This reduced the size of each file form 50K to 8K. The time taken by the Open & Read Operation was reduced to 4 - 5 mins.

Thank you.

A: 

Can u please explain how did u compress/uncompress the files. Is the compress/uncompress time lesser that actual read of the files ?