views:

79

answers:

4

I have some gigantic (several gigabyte) ASCII text files that I need to read in line-by-line, convert certain columns to floating point, and do a few simple operations on these numbers. It's pretty straightforward stuff, except that I'm thinking that there has to be a way to speed it up a whole bunch. The program never uses the equivalent of 100% of a CPU core because it spends so much time waiting on I/O. At the same time, it spends enough time doing computations instead of I/O that it only does ~8-10 MB/sec of raw disk I/O. I've seen my hard drive do a lot better than that.

Would it likely help to do the I/O and processing in separate threads? If so, what's an efficient way of implementing this? An important issue is what to do with memory allocation for holding each line so that I don't bottleneck on that.

Edit: I'm using the D programming language, version 2 standard lib., mostly the higher level functions, for most of this stuff right now. The buffer size used by std.stdio.File is 16 KB.

A: 

If you've got enough RAM, you could read the whole file into a string, tokenize it on line delimiters and process the tokens however you want.

In java you would use a StringBuilder object to read the file contents into it. You'd also want to launch the jvm with a sufficient memory limit (2GB in this example) using something like:

java -Xmx 2048 -Xms 2048 -jar MyMemoryHungryApp.jar

If you don't want to read the whole file into a string you could iteratively read it in batches and process the batches.

In fact, depending on the details of your file format, you could probably use CSVReader an open source Java package (project page) to read your file into memory ala the readAll() method, and you'll end up with a List<String[]> and you can go to town on it :).

vicatcu
+1  A: 

If you're not hitting 100% CPU then you're I/O bound, and won't see much/any improvement by multithreading - you'll just have several threads sitting waiting for I/O. Indeed, if they are accessing different parts of the file, you couldintroduce disk seeking and make things much worse.

Look first at the simpler things: Can you increase the amount of buffer RAM available for the I/O? (e.g. in C++, the standard I/O buffers for FILE objects are tiny (e.g. 4kB) setting a larger buffer (e.g. 64kB) can make a massive difference to the throughput).

Can you use larger buffer sizes in your I/O requests: e.g. Read 64KB of raw data into a large buffer, and then process that yourself, rather than reading one line or one byte at a time.

Are you outputting any data? By caching this in RAM instead of writing it immediately back to disk you can limit your IO to purely reading the input file, and help things go much faster.

You may find that once you are loading large buffers of data that you start to become CPU bound, at which point you can think about multithreading - one thread to read the data and other thread(s) to process it.

Jason Williams
A: 

First of all, I would take the program you've got, and get stackshots of it. That will tell for certain how much time is spent in I/O, and how much in CPU.

Then, if I/O is dominant, I would make sure I'm reading buffers as large as possible, to minimize disk head motions.

Then, if I'm seeing I/O waiting on CPU, followed by CPU waiting on I/O, I would try to do asynchronous I/O, so that one buffer could be loading while the CPU runs on the other. (Or you could do that with a reader thread, reading into alternate buffers.)

If I/O is not dominant and CPU is dominant, then I would see what stackshots tell me about the CPU activity. If an inordinate percent of time is being spent in the de-formatting of floating point numbers, and if the numbers are of fairly simple format, I would consider parsing them myself, because I can take advantage of the simpler format.

Does that help?

Mike Dunlavey
A: 

Normally the OS will try to read ahead and you should get near hard disk limit speeds if you are not CPU bound.

The cause can be:

  • Large file is fragmented (you might defragment the volume and check if things work better)
  • OS does not use read ahead (as a solution: under Windows you can use CreateFile with a flag that you will be scanning the file)
  • You do not use efficient buffering (eg if you read from an OS file handle only a few bytes at a time things will be slow. (you might try to read larger chunks at once)

The moment that you are CPU bound, should you start looking at more efficient parsing of the data.

Ritsaert Hornstra