Why are you using a regex? You're looking for the position of the literal text {{ or }}. Perl has a built-in that does exactly that: index.
Since you are trying to parse Wikipedia entries, you need to handle nested template directives. This means that, for instance, the second set of closing curlies you found doesn't necessarily go with the second set of open curlies. In this bit from the Perl entry, the first closing curly goes with the second opening one:
{{Infobox programming language
| latest_release_version = 5.10.0
| latest_release_date = {{release date|mf=yes|2007|12|18}}
| turing-complete = Yes
}}
Perl 5.10 regexes can handle this for you since they can match balanced text recursively, and there are Perl modules to do it as well. That's going to be a bit of work, though. It's difficult to give you any advice until you say what you are trying to accomplish. Surely there is a mediawiki parser out there that can do what you are trying to do.
I was going to code up my index()
solution, but I didn't. I can't get your code to be slow enough that it matters. Both the pos()
and the @-
solutions complete virtually instanteously for me, even when I do all of the stack management and print the contents of each template. I had to try really hard to make it run slow enough to be measurable, and I'm on some old hardware. You might need to tune your application in some other way.
Are you sure that the code you are measuring is slowing down at the point you think it is? Have you profiled it with Devel::NYTProf to see what your real program is doing?
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;
my $text = do { local $/; <DATA> }; # put the contents after __END__
my %subs = (
using_pos => sub {
my $page = shift;
my @stack;
my $found;
while( $$page =~ m/ ( \{\{ | }} ) /xg ) {
if( $1 eq '{{' ) { push @stack, pos($$page) - 2; }
else
{
my $start = pop @stack;
print STDERR "\tFound at $start: ", substr( $$page, $start, pos($$page) - $start ), "\n";
$found++;
};
}
print " Processed $found templates => ";
},
using_special => sub {
my $page = shift;
my @stack;
my $found;
while( $$page =~ m/ ( \{\{ | }} ) /xg ) {
if( $1 eq '{{' ) { push @stack, $-[0]; }
else
{
my $start = pop @stack;
print STDERR "\tFound at $start: ", substr( $$page, $start, $-[0] - $start ), "\n";
$found++;
};
}
print " Processed $found templates => ";
},
);
foreach my $key ( keys %subs )
{
printf "%15s => ", $key;
my $t = timeit( 1, sub{ $subs{$key}->( \$text ) } );
print timestr($t), "\n";
}
My perl on my 17" MacBook Pro:
macbookpro_brian[349]$ perl -V
Summary of my perl5 (revision 5 version 8 subversion 8) configuration:
Platform:
osname=darwin, osvers=8.8.2, archname=darwin-2level
uname='darwin macbookpro.local 8.8.2 darwin kernel version 8.8.2: thu sep 28 20:43:26 pdt 2006; root:xnu-792.14.14.obj~1release_i386 i386 i386 '
config_args='-des'
hint=recommended, useposix=true, d_sigaction=define
usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=undef use64bitall=undef uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-fno-common -DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -I/opt/local/include',
optimize='-O3',
cppflags='-no-cpp-precomp -fno-common -DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -I/opt/local/include'
ccversion='', gccversion='4.0.1 (Apple Computer, Inc. build 5363)', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries:
ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags =' -L/usr/local/lib -L/opt/local/lib'
libpth=/usr/local/lib /opt/local/lib /usr/lib
libs=-ldbm -ldl -lm -lc
perllibs=-ldl -lm -lc
libc=/usr/lib/libc.dylib, so=dylib, useshrplib=false, libperl=libperl.a
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' '
cccdlflags=' ', lddlflags=' -bundle -undefined dynamic_lookup -L/usr/local/lib -L/opt/local/lib'
Characteristics of this binary (from libperl):
Compile-time options: PERL_MALLOC_WRAP USE_LARGE_FILES USE_PERLIO
Built under darwin
Compiled at Apr 9 2007 10:36:26
@INC:
/usr/local/lib/perl5/5.8.8/darwin-2level
/usr/local/lib/perl5/5.8.8
/usr/local/lib/perl5/site_perl/5.8.8/darwin-2level
/usr/local/lib/perl5/site_perl/5.8.8
/usr/local/lib/perl5/site_perl