ansaurus

Question

regex '|' operator vs separate runs for each sub-expression

Answer 1

A:

I'm sure that the one-pass matching will win :)

amartynov 2009-02-24 08:57:21

Answer 2

A:

I suspect that the regex will also do what you are trying to do ... only much better :)

so the "|" would win

Learning 2009-02-24 09:00:54

Answer 3

+6 A:

The two things will give slightly different results, unless it is guaranteed that a match will match one and only one regex. Otherwise if something matches 2 it will be counted twice.

In theory your solution ought to be quicker (if the expression are mutually exclusive) because the regex compiler ought to be able to make a more efficient search state machine, so only one pass is needed. I would expect the difference to be tiny though, unless the expressions are very similar.

Also, if it were a huge string (bigger than 700k) there might be gains from doing one pass, and so a factor of n fewer memory swaps would be needed (to disk or cpu cache).

My bet is in your tests it isn't really noticeable though. I'm interested in the actual result - please do post the results.

Nick Fortescue 2009-02-24 09:03:55

Answer 4

A:

I agree with amartynov but I wanted to add that you also might consider compiling the regex first (re.compile()), esp. in the second variant as then you might save some setup time in the loop. Maybe you can measure this as well while you are on it.

The reason I think the one shot performs better is that I assume that it's fully done in C space and not so much python code needs to be interpreted.

But looking forward to numbers.

MrTopf 2009-02-24 09:05:16

Since every regex in his examples are only used once, you shouldn't get any performance improvement by pre-compiling.

Ferdinand Beyer 2009-02-24 09:12:57

even if you used each regex more than once, python's re module already caches compiled regexes for you, so in the second time onwards it would use the pre-compiled one anyway.

nosklo 2009-02-25 14:07:24

Answer 5

A:

A single compile and search should yield faster results, on a lower scale of expressions the gain could be negligible but the more you run through the greater gain. Think of it as compiling once and matching vs compiling 10 times and matching.

Christian Witts 2009-02-24 09:06:22

Answer 6

A:

I believe your first implementation will be faster:

One of the key principles for Python performance is "move logic to the C level" -- meaning built-in functions (written in C) are faster than pure-Python implementations. So, when the loop is performed by the built-in Regex module, it should be faster
One regex can search for multiple pattens in one pass, meaning it only has to run through your file contents once, whereas multiple regex will have to read the whole file multiple times.

Ferdinand Beyer 2009-02-24 09:10:06

Answer 7

+3 A:

To understand how re module works - compile _sre.c in debug mode (put #define VERBOSE at 103 line in _sre.c and recompile python). After this you ill see something like this:


>>> import re
>>> p = re.compile('(a)|(b)|(c)')
>>> p.search('a'); print '\n\n'; p.search('b')
|0xb7f9ab10|(nil)|SEARCH
prefix = (nil) 0 0
charset = (nil)
|0xb7f9ab1a|0xb7fb75f4|SEARCH
|0xb7f9ab1a|0xb7fb75f4|ENTER
allocating sre_match_context in 0 (32)
allocate/grow stack 1064
|0xb7f9ab1c|0xb7fb75f4|BRANCH
allocating sre_match_context in 32 (32)
|0xb7f9ab20|0xb7fb75f4|MARK 0
|0xb7f9ab24|0xb7fb75f4|LITERAL 97
|0xb7f9ab28|0xb7fb75f5|MARK 1
|0xb7f9ab2c|0xb7fb75f5|JUMP 20
|0xb7f9ab56|0xb7fb75f5|SUCCESS
discard data from 32 (32)
looking up sre_match_context at 0
|0xb7f9ab1c|0xb7fb75f4|JUMP_BRANCH
discard data from 0 (32)
|0xb7f9ab10|0xb7fb75f5|END




|0xb7f9ab10|(nil)|SEARCH
prefix = (nil) 0 0
charset = (nil)
|0xb7f9ab1a|0xb7fb7614|SEARCH
|0xb7f9ab1a|0xb7fb7614|ENTER
allocating sre_match_context in 0 (32)
allocate/grow stack 1064
|0xb7f9ab1c|0xb7fb7614|BRANCH
allocating sre_match_context in 32 (32)
|0xb7f9ab20|0xb7fb7614|MARK 0
|0xb7f9ab24|0xb7fb7614|LITERAL 97
discard data from 32 (32)
looking up sre_match_context at 0
|0xb7f9ab1c|0xb7fb7614|JUMP_BRANCH
allocating sre_match_context in 32 (32)
|0xb7f9ab32|0xb7fb7614|MARK 2
|0xb7f9ab36|0xb7fb7614|LITERAL 98
|0xb7f9ab3a|0xb7fb7615|MARK 3
|0xb7f9ab3e|0xb7fb7615|JUMP 11
|0xb7f9ab56|0xb7fb7615|SUCCESS
discard data from 32 (32)
looking up sre_match_context at 0
|0xb7f9ab2e|0xb7fb7614|JUMP_BRANCH
discard data from 0 (32)
|0xb7f9ab10|0xb7fb7615|END

>>>

Mykola Kharechko 2009-02-24 15:40:35

Replace #undef by #define for VERBOSE and VVERBOSE in http://svn.python.org/view/python/trunk/Modules/_sre.c?view=markup

J.F. Sebastian 2009-03-03 15:55:42

What's that supposed to mean?

ThomasH 2009-10-07 21:16:12

Answer 8

A:

The fewer passes the better: It'll just use more memory, which is typically not an issue.

If anything can be left to the interpreter to handle, it will always find a faster solution (both in time to implement and time to execute) than the typical human counterpart.

Nerdling 2009-02-24 19:59:17

ansaurus

tags:

views:

answers:

regex '|' operator vs separate runs for each sub-expression

related questions