Say I want to extract the first word (or floating point number) that follows a given string found in some text file (see http://stackoverflow.com/questions/3549877/how-to-extract-the-first-word-that-follows-a-string). I know you can do it with perl, or sed, and probably many other ways. I am looking for performance. What is the fastest way to parse?
As with most similar questions, I would implement your particular case and measure it using a selection of applicable tools.
You may find that other factors skew your figures e.g. speed to open/read file etc. and so a technology recommendation would be worthless.
I'd bet, if it's really a simple case (no real regular expression involved), C would win (memchr() and friends), otherwise TCL would be a strong contender, followed by Perl (latter requires good crafting skills for regular expression design).
Addendum: reading your description again (and the accompanying link), I'd say you should test some tools and compare the results. Can you provide an input file for us (somewhere) and specify the exact problem based on that file?
Addendum 2 (for Feynman): I saved this site (~44KB, 'sof.dat') and performed a repeated search test (as you proposed: "site design /") in Perl. This is the code used:
...
use Benchmark qw' timeit timestr ';
{
open my $fh, '<', 'sof.dat' or die $!;
undef $/;
our $data = <$fh>;
close $fh;
my $count = 100000;
my $t = timeit($count, ' $data =~ m|site design /| ? 1 : 0 ' );
print "$count loops of code took:", timestr($t),"\n";
}
...
With my setup (2,4GHz E6600, Win7, Activeperl), the following timings came out:
100000 loops of code took: 1 wallclock secs ( 1.36 usr + 0.00 sys = 1.36 CPU) @ 73746.31/s (n=100000)
which means, I can find this text in this page about 74 thousand times per second.
If I include the full file processing:
use Benchmark qw' timeit timestr ';
{
my $count = 10000;
my $t = timeit($count, q{
open my $fh, '<', 'sof.dat' or die $!;
undef $/;
our $data = <$fh>;
close $fh;
$data =~ m|site design /| ? 1 : 0
} );
print "$count loops of code took:", timestr($t),"\n";
}
we'll get only:
10000 loops of code took: 3 wallclock secs ( 1.70 usr + 1.51 sys = 3.21 CPU) @ 3110.42/s (n=10000)
about 3 thousand search/find passes per second.
Regards
rbo
Most of the time you will be reading the string from the disk, or even worse, reading it from network, and that is already orders of magnitude slower than any sane way of searching you choose. Having said that, if your string is already in memory, and you know your search term up front, regular expressions (real FSM-based ones, not Perl's) are guaranteed to work in time linear to the input size -- so anything faster is only going to be faster by a constant factor. There is a fast regular expression engine library called re2.
If you're looking for a fixed string, you probably want to search for it using something like Boyer-Moore or Boyer-Moore-Horspool (for the latter, I'd recommend Ray Gardner's implementation). Note that B-M and B-M-H are both sublinear. Regular expressions, by contrast, are linear at best1, and many implementations (those that use backtracking) are quadratic.
The next step is to ensure you're reading the data into memory as quickly as possible. In reality, this is typically going to be the bottleneck. Unfortunately, to deal well with the bottleneck, you typically have to use some non-portable code. Under Linux, mmap
tends to be about your best bet, while under Windows you're usually better off reading large chunks at a time, and invoking CreateFile
with the FILE_FLAG_NO_BUFFERING
flag. It's also worth using I/O completion ports (IOCP) to carry out the reading, so you can carry out the searching and the reading in parallel.
1In theory it would be possible to write an RE engine that did sublinear searching for the right kinds of patterns -- but if there's any that actually does, I'm not aware of it.