views:

1064

answers:

6

I'm looking for a performance comparison between perl and boost regular expression.
I need to design a piece of code which relies very heavily on regular expressions, and can choose between:

  1. running it through a boost regex
  2. dispatching a perl interpreter and do the work in perl

I know perl is known for it's optimized string processing. However, I can't find a performance comparison to boost regex library.
Do you know of any such comparison?
Thanks

+6  A: 

If you haven't seen it yet, there's a regexp benchmark in the Great Language Shootout. It doesn't rank Perl very high at all. A Boost implementation using boost::xpressive is ranked first (which pre-compiles the expression at compile time). However, this is a microbenchmark, so probably not representative of general regular expression speed, but still worth a look.

Surprisingly enough, apparently the fastest regular expression engine by far is Google Chrome's V8 JavaScript JIT (almost beats GCC in wall-clock time, utilizing just a single CPU core)

intgr
I seriously doubt that regex benchmark is accurate. I've never seen Python's regex engine outperform Perl's.
Chris Lutz
Someone contributed a Boost benchmark now, and it's in first place. Did someone contribute this today?
Ken Bloom
Also, there are two Perl versions that occur in the "Interesting Alternatives" section that are faster than anything else.
Chris Lutz
@Chris Lutz "Interesting Alternatives" - because those Perl programs split the pattern beforehand - contribute a Perl program that doesn't do that.
igouy
@Ken Bloom "Did someone contribute this today?" - No, that's been around since mid-August but V8 was rebuilt and remeasured. Also sort by CPU secs and see what difference that makes.
igouy
igouy
The ones that are *much* faster than Perl are running multithreaded (mostly functional code, although one Java one manages it). The notable exception is the V8 JavaScript version, which manages almost 10x faster than Perl on one core.
hobbs
@Ken Bloom: Sorry, I didn't look into the C++ program and I didn't realize that it uses Boost. Sorry about misinformation. :(
intgr
+8  A: 

The startup cost of running a Perl interpreter from within your application (via the system function I presume) will outweigh any benefits you gain over using Perl's regex engine. The exception would be if you have a VERY complicated regular expression that Perl's regex implementation happens to be optimised for but boost's regex engine isn't.

The real answer is that I do not know of any such comparison, but Perl's regular expression facilities are not necessarily the fastest. See here for some information about an algorithm that beats Perl's regular expression for some expressions.

EDIT: It is possible to overcome the startup cost of starting a full perl interpreter by linking to libperl or using libPCRE. And using boost will probably give you more flexibility and performance tuning options if you need them.

Final Note: There are no known direct comparisons between boost.regex and Perl's regex in terms of performance. The solution is to try both and see which is more performant for the OP's specific situation.

barkmadley
You can also link to libperl and use Perl's regular expressions without having the overhead of the full Perl interpreter. The API may not be designed for it, but it's certainly possible.
Chris Lutz
There's also libPCRE: http://www.pcre.org/
greyfade
Technically, PCRE has some slight differences from Perl's regexes, but they're edge cases. I would bet Perl's regexes are slightly faster but I doubt it matters.
Chris Lutz
That link is mainly about how basic regular expressions that act as a true FSA have a known performance and that newer features add complexity which can degrade performance. I have used those findings to write more determinant REs using Perl syntax.
Axeman
"that beats Perl's regular expression for some expressions" ... by constructing very specific regular expressions in a peculiar way. It sends the NFA into a sort of worst case scenario. You can write equivalent expressions that perform much better with perl's regular expression engine. The DFA that is used in that biased study doesn't support the advanced, non-regular, and hugely useful features of a perl or PCRE regular expression engine. Let's put this together: With the NFA, you *can* get great performance and great features. With the DFA you only get great performance. Choose wisely.
tsee
A: 

The perl interpreter is going to be a fixed cost. If the time saved by running your data through the interpreter greatly outweighs the interpreter costs(ie, you have a lot of data), you will have a performance boost.

It's probable that you're best of with pure C++ here, just because of the process invocation.

Sorry, I don't have data. Love to see your test results though.

Paul Nathan
+2  A: 

Unless your regex is insanely complex (for which perl's regex engine is incredibly fast by the way) then as other's have said, your overhead is in interpreter startup. On the other hand you could run a persistent perl that provides a regex server quite easily.

singingfish
+6  A: 

If your regular expressions are fixed at compile time, you could also consider Boost.XPressive. It allows one to write regexes as expression templates that are parsed at compile time.

Éric Malenfant
+2  A: 

Start with the simplest solution. Decide how fast it needs to be for your application. Then measure the speed. If it's too slow, try the harder solution. Measure again. Repeat as necessary.

While my gut agrees with most of the other answers saying that starting the interpreter will be more expensive, you'll never know until you measure.

There's "fastest possible" and "fast enough for your application". Don't add complexity to get the former if you already have the latter.

Adrian McCarthy