I understand that both Java and Perl try quite hard to find a one-size-fits all default buffer size when reading in files, but I find their choices to be increasingly antiquated, and am having a problem changing the default choice when it comes to Perl.
In the case of Perl, which I believe uses 8K buffers by default, similar to Java's choice, I can't find a reference using the perldoc website search engine (really Google) on how to increase the default file input buffer size to say, 64K.
From the above link, to show how 8K buffers don't scale:
If lines typically have about 60 characters each, then the 10,000-line file has about 610,000 characters in it. Reading the file line-by-line with buffering only requires 75 system calls and 75 waits for the disk, instead of 10,001.
So for a 50,000,000 line file with 60 characters per line (including the newline at the end), with an 8K buffer, it's going to make 366211 system calls to read a 2.8GiB file. As an aside, you can confirm this behaviour by looking at the disk i/o read delta (in Windows at least, top in *nix shows the same thing somehow too I'm sure) in the task manager process list as your Perl program takes 10 minutes to read in a text file :)
Someone asked the question about increasing the Perl input buffer size on perlmonks, someone replied here that you could increase the size of "$/", and thus increase the buffer size, however from the perldoc:
Setting $/ to a reference to an integer, scalar containing an integer, or scalar that's convertible to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer.
So I assume that this does not actually increase the buffer size that Perl uses to read ahead from the disk when using the typical:
while(<>) {
#do something with $_ here
...
}
"line-by-line" idiom.
Now it could be that a different "read a record at a time and then parse it into lines" version of the above code would be faster in general, and bypass the underlying problem with the standard idiom and not being able to change the default buffer size (if that's indeed impossible), because you could set the "record size" to anything you wanted and then parse each record into individual lines, and hope that Perl does the right thing and ends up doing one system call per record, but it adds complexity, and all I really want to do is get an easy performance gain by increasing the buffer used in the above example to a reasonably large size, say 64K, or even tuning that buffer size to the optimal size for long reads using a test script on my system, without needing extra hassle.
Things are much better in Java as far as straight-forward support for increasing the buffer size goes.
In Java, I believe the current default buffer size that java.io.BufferedReader uses is also 8192 bytes, although up-to-date references in the JDK docs are equivocal, e.g., the 1.5 docs say only:
The buffer size may be specified, or the default size may be accepted. The default is large enough for most purposes.
Luckily with Java you do not have to trust the JDK developers to have made the right decision for your application and can set your own buffer size (64K in this example):
import java.io.BufferedReader;
[...]
reader = new BufferedReader(new InputStreamReader(fileInputStream, "UTF-8"), 65536);
[...]
while (true) {
String line = reader.readLine();
if (line == null) {
break;
}
/* do something with the line here */
foo(line);
}
There's only so much performance you can squeeze out of parsing one line at a time, even with a huge buffer, and modern hardware, and I'm sure there are ways to get every ounce of performance out of reading in a file by reading big many-line records and breaking each into tokens then doing stuff with those tokens once per record, but they add complexity and edge cases (although if there's an elegant solution in pure Java (only using the features present in JDK 1.5) that would be cool to know about). Increasing the buffer size in Perl would solve 80% of the performance problem for Perl at least, while keeping things straight-forward.
My question is:
Is there a way to adjust that buffer size in Perl for the above typical "line-by-line" idiom, similar how the buffer size was increased in the Java example?