tags:

views:

144

answers:

4

Say I want to extract the first word (or floating point number) that follows a given string found in some text file (see http://stackoverflow.com/questions/3549877/how-to-extract-the-first-word-that-follows-a-string). I know you can do it with perl, or sed, and probably many other ways. I am looking for performance. What is the fastest way to parse?

+1  A: 

As with most similar questions, I would implement your particular case and measure it using a selection of applicable tools.

You may find that other factors skew your figures e.g. speed to open/read file etc. and so a technology recommendation would be worthless.

Brian Agnew
A: 

I'd bet, if it's really a simple case (no real regular expression involved), C would win (memchr() and friends), otherwise TCL would be a strong contender, followed by Perl (latter requires good crafting skills for regular expression design).

Addendum: reading your description again (and the accompanying link), I'd say you should test some tools and compare the results. Can you provide an input file for us (somewhere) and specify the exact problem based on that file?

Addendum 2 (for Feynman): I saved this site (~44KB, 'sof.dat') and performed a repeated search test (as you proposed: "site design /") in Perl. This is the code used:

 ...
 use Benchmark qw' timeit timestr ';

{
  open my $fh, '<', 'sof.dat' or die $!;
  undef $/;
  our $data = <$fh>;
  close $fh;

   my $count = 100000;
   my  $t = timeit($count, ' $data =~ m|site design /| ? 1 : 0 '  );
   print "$count loops of code took:", timestr($t),"\n";
}
...

With my setup (2,4GHz E6600, Win7, Activeperl), the following timings came out:

100000 loops of code took: 1 wallclock secs ( 1.36 usr + 0.00 sys = 1.36 CPU) @ 73746.31/s (n=100000)

which means, I can find this text in this page about 74 thousand times per second.

If I include the full file processing:

use Benchmark qw' timeit timestr ';

{
   my $count = 10000;
   my  $t = timeit($count, q{
                   open my $fh, '<', 'sof.dat' or die $!;
                   undef $/;
                   our $data = <$fh>;
                   close $fh;
                   $data =~ m|site design /| ? 1 : 0
                   } );
   print "$count loops of code took:", timestr($t),"\n";
}

we'll get only:

10000 loops of code took: 3 wallclock secs ( 1.70 usr + 1.51 sys = 3.21 CPU) @ 3110.42/s (n=10000)

about 3 thousand search/find passes per second.

Regards

rbo

rubber boots
My project is still in its infancy. I wanted to pick a good method from the start. I do not have any unusually long data files at this point. How would I attach them if I could make some up?
Feynman
ok how about this: use the html source file of this page. Find the text following, say "site design /" (no quotes)
Feynman
@Feynman, I made another addendum
rubber boots
wow. Thanks! So I guess I better write an equivalent program in C, bash, and TCL. I have a bash program, but I just copied the code from a thread from another forum. I really do not know these languages very well (or for TCL and pearl at all). I am trying to automate generating input files for some other software and extracting lines from the output. I think conducting this experiment is beyond me. I just learn the basics and get the rest from google. Thanks for the input.
Feynman
+1  A: 

Most of the time you will be reading the string from the disk, or even worse, reading it from network, and that is already orders of magnitude slower than any sane way of searching you choose. Having said that, if your string is already in memory, and you know your search term up front, regular expressions (real FSM-based ones, not Perl's) are guaranteed to work in time linear to the input size -- so anything faster is only going to be faster by a constant factor. There is a fast regular expression engine library called re2.

Radomir Dopieralski
See my answer -- the "only faster by a constant factor" part isn't necessarily true.
Jerry Coffin
You are right, if the whole string is already in memory you can be faster. I didn't think about it because most of the time need to read it first, and that is going to be linear at best. Well, unless the string you are looking for is really long, so that you can skip substantial parts of input and not read/download it at all...
Radomir Dopieralski
@Radomir: right -- realistically, it probably doesn't mean much, and your primary concentration should be be on fast reading.
Jerry Coffin
+3  A: 

If you're looking for a fixed string, you probably want to search for it using something like Boyer-Moore or Boyer-Moore-Horspool (for the latter, I'd recommend Ray Gardner's implementation). Note that B-M and B-M-H are both sublinear. Regular expressions, by contrast, are linear at best1, and many implementations (those that use backtracking) are quadratic.

The next step is to ensure you're reading the data into memory as quickly as possible. In reality, this is typically going to be the bottleneck. Unfortunately, to deal well with the bottleneck, you typically have to use some non-portable code. Under Linux, mmap tends to be about your best bet, while under Windows you're usually better off reading large chunks at a time, and invoking CreateFile with the FILE_FLAG_NO_BUFFERING flag. It's also worth using I/O completion ports (IOCP) to carry out the reading, so you can carry out the searching and the reading in parallel.

1In theory it would be possible to write an RE engine that did sublinear searching for the right kinds of patterns -- but if there's any that actually does, I'm not aware of it.

Jerry Coffin
Why use `FILE_FLAG_NO_BUFFERING`? Won't that disable the read-ahead, causing *worse* throughput?
Gabe
It disables the read-ahead, thus the recommendation toward using IOCP (where you can control the readahead explicitly). The problem is that caching pre-supposes that you'll use the same data again soon, but in this case, that's unlikely. At the same time, it can/will flush data from the cache that really does stand a good chance of being used -- and if that data is dirty, flushing it will require writing it out to disk...
Jerry Coffin