views:

192

answers:

4

I am creating a Perl script which will have to process the markup of millions of Wikipedia articles - so speed is an issue.

One of the things I'm looking for are occurrences of templates, which always look like this: {{template}}. Because these can be complicated and nested, I need to find the start and end tags separately, and know the character indexes where they are found.

So here is some simple code (assume $text is the text with the templates in it):

my $matchIndex ;

my $startCount = 0 ;
my $endCount = 0 ;

# find all occurrences of template start and template end tags
while($text =~ m/(\{\{)|(\}\})/gs) {

    $matchIndex = $+[0] ;

    if (defined $1) {
     #this is the start of a template
     $startCount ++ ;
    } else {
     #this is the end of a template
     $endCount++ ;
    }
 }

The really weird thing about this code is that the $matchIndex = $+[0] ; line makes a huge difference to efficiency, even though it is just looking up a value in an array. Without this commented out, a complex Wikipedia article (containing 2000 templates - crazy but it happens) is processed in 0m0.080s. Keeping it there bumps the time up to 0m2.646s. What the hell?

Maybe it sounds like I'm splitting hairs, but this is the difference between processing Wikipedia in hours, or processing it in weeks.

+4  A: 

Update:

Your timings are a little suspect:

#!/usr/bin/perl

use strict;
use warnings;

my $text = '{{abcdefg}}' x 100_000;

my @match_pos;
my ($start_count, $end_count);

while ( $text =~ /({{)|(}})/g ) {
    push @match_pos, $-[0];
    if ( defined $1 ) {
        ++$start_count;
    }
    else {
        ++$end_count;
    }
}

Let's time it:

C:\Temp> timethis zxc.pl

TimeThis :  Command Line :  zxc.pl
TimeThis :  Elapsed Time :  00:00:00.985

Replacing $-[0] with length $` takes too long to complete (I pressed CTRL-C after a minute).

If I make 2_000 copies of the simple pattern above, the timings end up being the same (about .2 seconds). So, I would recommend using $-[0] for scalability.

Previous discussion

From perldoc perlvar:

# @LAST_MATCH_START
# @-

$-[0] is the offset of the start of the last successful match. $-[n] is the offset of the start of the substring matched by n-th subpattern, or undef if the subpattern did not match.

See also @+.

The s option in your regex is unnecessary because there is no . in the pattern.

Have you looked at Text::Balanced?

You could also use pos, although I am not sure if it will satisfy your performance requirements.

#!/usr/bin/perl

use strict;
use warnings;

use File::Slurp;

my $text = read_file \*DATA;

my @match_pos;
my ($start_count, $end_count);

while ( $text =~ /({{)|(}})/g ) {
    push @match_pos, pos($text) - 2;
    # push @match_pos, $-[0]; # seems to be slightly faster
    if ( defined $1 ) {
        ++$start_count;
    }
    else {
        ++$end_count;
    }
}

for my $i ( @match_pos ) {
    print substr($text, $i, 2), "\n";
}

__DATA__
Copy & paste the source of the complicated Wikipedia page here to test.
Sinan Ünür
Umm, thanks, that copy paste job was real helpful... The if(defined $1) thing works fine, and is very quick. If I am in the loop, it means the match succeeded, so that isn't what I'm testing. There are two groups in the regex - one for the start of the template, one for the end. If $1 is defined, I've matched the first group, so it's the start of the template. Otherwise it's the end.
@David Milne Feel free to downvote unhelpful answers but your sarcasm is not helpful either. I misread the patter because of the unnecessary clutter created by escaping the `{` and `}` characters. They are not special and need not be escaped.
Sinan Ünür
Actually, `{x}` (where x is a number or number pair) is used to specify repetitions of the previous item in regex - for instance, `\s{2,5}` matches 2-5 whitespace characters.
Amber
@Dav How does that apply to the code above?
Sinan Ünür
You stated that `{` and `}` are not special and need not be escaped. This is not a true statement.
Amber
@Dav **in the context of the code given by the OP**. The OP did not need to escape them. There is a difference between `{{` and `\s{2,5}`.
Sinan Ünür
@Dav Try: `my $x = '{{{{{{{';``while ( $x =~ /({)/g ) {`` print "$1\n";``}`
Sinan Ünür
Indeed, even \s{,5} is not special; it matches a space char, left curly, comma, digit 5, right curly. Nevertheless, if you have arbitrary input, you should escape curlies.
ysth
aha, now try with "{{abc\x{10f}efg}}" instead of "{{abcdefg}}" (but be prepared to wait a *long* time)
ysth
@David Thank you. Now, using my code, both pos and $-[0] give similar timings, roughly 0.3s. On the other hand, using length $\` takes twice as long. As I mentioned in my updated answer, the code slows down a lot for larger inputs if $\` is used, so avoid that.
Sinan Ünür
@ysth I did (using `$-[0]`) and `timethis` gave me `1.2` seconds with `100_000` copies. Could you expand on your comment please?
Sinan Ünür
If I use your code exactly Sinan, then yes, I get similar times for both approaches: 0m0.181s for $-[0], and 0m0.185s for pos.BUT, if I change the encoding of the input to utf8, then the times go back to the way I showed earlier: 0m2.659s for $-[0] and 0m0.175s for pos. Try it! weird, huh?Ah well, thanks for your help. I guess I will stick with pos (Wikipedia has a lot of special characters).
Sinan Ünür
If you are still interested, try modifying the way your data is read with <code>binmode(DATA, ':utf8');</code>
@David Nope, does not make any difference. Same time with or without 'utf8'. Same time with `$-[0]` and `pos`. Maybe its the version of perl I am using ActiveState Perl 5.10.1004 on Windows.
Sinan Ünür
@Brian: the nesting thing is exactly WHY I am doing what I am doing. I am tracking both opening and closing brackets, so I can use a FILO stack to keep track of nesting (push openings onto the stack, pop them off when we find their closings). If nesting wasn't a problem, I would just try to capture each whole template with a single regex. Stacks are perfect for this job - it's how all parsers go about this kind of problem.
@Brian: and by the way, the benchmark document I posted has a LOT of nesting.
nope, all you need are the character indexes. On an open tag, push the character index onto the stack. On a close tag, pop it off. At this point you know exactly where your template is. And if there is anything left on the stack you know it is nested inside another template. Simple.
+3  A: 

$+[0] is not just an array lookup; it uses the magic interface to delve into the regex result structure to look up the desired value. But I have trouble believing that 2000 iterations of that are taking 2 seconds. Can you post an actual benchmark?

Did you try using pos, as suggested by Sinan Ünür?

Update: it occurs to me that the translation between byte offset and character offset (which should be efficiently cached) may be degrading your performance. Try running utf8::encode() on your string initially, then utf8::decode on individual pieces of captured text if needed.

ysth
@ysth Inserting `\x{10f}` in the middle seems to cause a 20% slowdown with `100_000` copies of the simple string I had which, I presume, is related to the byte versus character offset issue you mention above. Still, it is not an order of magnitude slowdown.
Sinan Ünür
@ysth It really does seem like this difference between byte offset and character offset is the problem, because the slow down only occurs when I read the document in as utf8. All I am doing is using binmode to set the encoding of the input stream to utf8. By my understanding that is all I need to do, but should I be doing something to the resulting string as well?
+5  A: 

Why are you using a regex? You're looking for the position of the literal text {{ or }}. Perl has a built-in that does exactly that: index.

Since you are trying to parse Wikipedia entries, you need to handle nested template directives. This means that, for instance, the second set of closing curlies you found doesn't necessarily go with the second set of open curlies. In this bit from the Perl entry, the first closing curly goes with the second opening one:

{{Infobox programming language
| latest_release_version = 5.10.0
| latest_release_date    = {{release date|mf=yes|2007|12|18}}
| turing-complete        = Yes
}}

Perl 5.10 regexes can handle this for you since they can match balanced text recursively, and there are Perl modules to do it as well. That's going to be a bit of work, though. It's difficult to give you any advice until you say what you are trying to accomplish. Surely there is a mediawiki parser out there that can do what you are trying to do.


I was going to code up my index() solution, but I didn't. I can't get your code to be slow enough that it matters. Both the pos() and the @- solutions complete virtually instanteously for me, even when I do all of the stack management and print the contents of each template. I had to try really hard to make it run slow enough to be measurable, and I'm on some old hardware. You might need to tune your application in some other way.

Are you sure that the code you are measuring is slowing down at the point you think it is? Have you profiled it with Devel::NYTProf to see what your real program is doing?

#!/usr/bin/perl
use strict;
use warnings;

use Benchmark;

my $text = do { local $/; <DATA> }; # put the contents after __END__

my %subs = (
 using_pos     => sub {
  my $page = shift;

  my @stack;
  my $found;
  while( $$page =~ m/ ( \{\{ | }} ) /xg ) {   
   if( $1 eq '{{' ) { push @stack, pos($$page) - 2; }
   else             
    { 
    my $start = pop @stack;
    print STDERR "\tFound at $start: ", substr( $$page, $start, pos($$page) - $start ), "\n";
    $found++;
    };
   }

  print " Processed $found templates => ";
  },

 using_special => sub {
  my $page = shift;

  my @stack;
  my $found;
  while( $$page =~ m/ ( \{\{ | }} ) /xg ) {   
   if( $1 eq '{{' ) { push @stack, $-[0]; }
   else             
    { 
    my $start = pop @stack;
    print STDERR "\tFound at $start: ", substr( $$page, $start, $-[0] - $start ), "\n";
    $found++;
    };
   }

  print " Processed $found templates => ";
  },

 );

foreach my $key ( keys %subs )
 {
 printf "%15s => ", $key;

 my $t = timeit( 1, sub{ $subs{$key}->( \$text ) } );
 print timestr($t), "\n";
 }

My perl on my 17" MacBook Pro:

macbookpro_brian[349]$ perl -V
Summary of my perl5 (revision 5 version 8 subversion 8) configuration:
  Platform:
    osname=darwin, osvers=8.8.2, archname=darwin-2level
    uname='darwin macbookpro.local 8.8.2 darwin kernel version 8.8.2: thu sep 28 20:43:26 pdt 2006; root:xnu-792.14.14.obj~1release_i386 i386 i386 '
    config_args='-des'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fno-common -DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -I/opt/local/include',
    optimize='-O3',
    cppflags='-no-cpp-precomp -fno-common -DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -I/opt/local/include'
    ccversion='', gccversion='4.0.1 (Apple Computer, Inc. build 5363)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags =' -L/usr/local/lib -L/opt/local/lib'
    libpth=/usr/local/lib /opt/local/lib /usr/lib
    libs=-ldbm -ldl -lm -lc
    perllibs=-ldl -lm -lc
    libc=/usr/lib/libc.dylib, so=dylib, useshrplib=false, libperl=libperl.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' '
    cccdlflags=' ', lddlflags=' -bundle -undefined dynamic_lookup -L/usr/local/lib -L/opt/local/lib'


Characteristics of this binary (from libperl): 
  Compile-time options: PERL_MALLOC_WRAP USE_LARGE_FILES USE_PERLIO
  Built under darwin
  Compiled at Apr  9 2007 10:36:26
  @INC:
    /usr/local/lib/perl5/5.8.8/darwin-2level
    /usr/local/lib/perl5/5.8.8
    /usr/local/lib/perl5/site_perl/5.8.8/darwin-2level
    /usr/local/lib/perl5/site_perl/5.8.8
    /usr/local/lib/perl5/site_perl
brian d foy
the nesting thing is exactly WHY I am doing what I am doing. I am tracking both opening and closing brackets, so I can use a FILO stack to keep track of nesting (push openings onto the stack, pop them off when we find their closings). If nesting wasn't a problem, I would just try to capture each whole template with a single regex. Stacks are perfect for this job - it's how all parsers go about this kind of problem. The benchmark document I posted has exactly the nesting you are talking about.
The reason I am not using index is that I would like the approach to generalize to other problems - detecting links, html tags, etc. I haven't found any lightweight Perl parsers (MediaWiki describes it's own parser as "laberinthine"), and already have quite a large investment of code thrown at getting what I want from Wikipedia.
arrgh. I cut down the code and context to a minimum to ask one simple question: why did $+[0] take so long and whether there was there a faster alternative. I didn't provide an exhaustive account of what I am trying to do because I didn't come here for help with that. In future, please focus on the specific question, rather than starting a pissing contest with comments like "This technique has no hope of working."
@brian. Umm, ok. We both agree that stacks are a good way to go about this, so that's cool. Once you started coding up your solution you realized the advantage of using regex - you can search for both start and end points simultaneously (good because we don't know which one will come first, and there is no need to inefficiently use index to search for one, and then the other). So that part of my solution is cool too. Where exactly is the point where I got emotionally involved with an unreasonable solution, and needed this "real help"?
"We both agree that stacks are a good way to go about this" "WE BOTH AGREE". So I'm not saying that you were saying that... Argh forget it.
I've deleted all my comments, regret trying to help you, and won't make that mistake again. Sorry for the fuss.
brian d foy
A: 

Unless you are running it on the Wikipedia sever, the network latency will be an order of magnitude more significant than tweaks to your script, and even then it will be marginal.

The MediaWiki API and CPAN JSON module might be of more use to you, depending on what you are trying to do of course.

Chris Huang-Leaver
I'm running these scripts on the Wikipedia xml dumps (locally), so network latency is not an issue.