views:

187

answers:

3

Hi, all

I have a program which reads data from 2 text files and then save the result to another file. Since there are many data to be read and written which cause a performance hit, I want to parallize the reading and writing operations.

My initial thought is, use 2 threads as an example, one thread read/write from the beginning, and another thread read/write from the middle of the file. Since my files are formatted as lines, not bytes(each line may have different bytes of data), seek by byte does not work for me. And the solution I could think of is use getline() to skip over the previous lines first, which might be not efficient.

Is there any good way to seek to a specified line in a file? or do you have any other ideas to parallize file reading and writing?

Environment: Win32, C++, NTFS, Single Hard Disk

Thanks.

-Dbger

+7  A: 

Generally speaking, you do NOT want to parallelize disk I/O. Hard disks do not like random I/O because they have to continuously seek around to get to the data. Assuming you're not using RAID, and you're using hard drives as opposed to some solid state memory, you will see a severe performance degradation if you parallelize I/O(even when using technologies like those, you can still see some performance degradation when doing lots of random I/O).

To answer your second question, there really isn't a good way to seek to a certain line in a file; you can only explicitly seek to a byte offset using the read function(see this page for more details on how to use it.

Mike
So in file reading/writing, disk seek cost most of the time, which is the case in multi-threading environment, is that right?
lz_prgmr
Yes, disk seek time will generally be the bottleneck is a multithreaded I/O environment. You should try to serialize your I/O where possible.
Mike
Thanks Mike, just to confirm, is this only apply when read a single file, or also apply when read multiple files (thread 1 read file1 ,thread2 read file2)
lz_prgmr
What I said applies to ANY disk I/O on a single disk, regardless of whether there are separate files. Of course, caching by the OS or disk will have some effect on actual results.
Mike
+1  A: 

This isn't really an answer to your question but rather a re-design (which we all hate but can't help doing). As already mentioned, trying to speed up I/O on a hard disk with multiple threads probably won't help.

However, it might be possible to use another approach depending on data sensitivity, throughput needs, data size, etc. It would not be difficult to create a structure in memory that maintains a picture of the data and allows easy/fast updates of the lines of text anywhere in the data. You could then use a dedicated thread that simply monitors that structure and whose job it is to write the data to disk. Writing data sequentially to disk can be extremely fast; it can be much faster than seeking randomly to different sections and writing it in pieces.

Mark Wilkins
When I write about 2M data into a text file, sequentially, it costs about 1 second on my machine, which is too slow for me. As to read, in order to form a memory structure of the file, I need to read the data in first, which is also too slow to meet my requirement. However, I would investigate the topics on overlap I/O and Memorymap file to see if that helps.
lz_prgmr
1 second to write 2MB? That seems amazingly slow. I just ran a test that writes 10M to a file in about 100ms, and my PC is no real speed machine (3.2GHz and I *think* 7200rpm drive). What APIs are you using to open and write to the file with?
Mark Wilkins
I am using std::ofstream to save lots of separated data in a loop. like "for(...){streamOut << x; streamOut<<y}", and I also have a 7200rpm drive with a dual core 2.16GHz CPU
lz_prgmr
That is interesting. If I get time, I may have to test that on my PC out of curiosity. I was simply using the Win32 APIs (CreateFile, WriteFile). But in reality, I would expect the streamio to go through those APIs on Win32. Or if not, it would still be through some kind of buffered I/O. The average latency of a 7200 rpm disk should be under 5ms. That should allow for a lot of buffered writes. I suppose if the disk were completely fragmented into 4096 chunks, it would come out to 1 second/MB.
Mark Wilkins
MarkW, it turns out that most of the time is spending on string formating when calling "streamOut << x << " " << y << " " << z <<endl". I then changed the code to format all those data into a string first, and then write out to the file all in once, it costs about 24ms to write 2M data. Then by parallelizing the string formating there is a noticable performance gain. Thanks so much.
lz_prgmr
Cool. I'm glad you got it sped up. Those numbers make a lot more sense. Thanks for reporting back on it. I was curious but never had time to test it.
Mark Wilkins
+1  A: 

Queuing multiple reads and writes won't help when you're running against one disk. If your app also performed a lot of work in CPU then you could do your reads and writes asynchronously and let the CPU work while the disk I/O occurs in the background. Alternatively, get a second physical hard drive: read from one, write to the other. For modestly sized data sets that's often effective and quite a bit cheaper than writing code.

Curt Nichols
Use a backgroud thread to write the output data gradually when CPU is busy with computing, it is a good idea. But as to read, there are not much work can be done as the data is not ready.
lz_prgmr
Dbger, it depends on the nature of your data. If you're able to queue a second asynchronous fetch to be satisfied while processing the first fetch's data you're in business. Again, it's most effective if the disk isn't busy with other I/O, so possibly not applicable to your immediate situation.
Curt Nichols