ansaurus

Question

Standard Regex vs python regex discrepancy

Answer 1

+3 A:

1

Use bool(p.match('aa')) to check if the regexp matches or not
p = re.compile('b*(abb*)*a?$')
\b matches border of string; place between \w and \W (word characters and non-word characters)

2

Regexp is quite standard in python. Yet every language has some flavour of them, they are not 100% portable. There are minor differences which you're expected to lookup prior to using regexp in any specific language.

Addition

\epsilon does not have special symbol in python. It is an empty character set.

In your example a|\epsilon is equivalent to (a|) or just a?. After which $ is obligatory to match end of string.

Antony Hatchkins 2010-01-12 14:44:51

I don't think the OP wants a word boundary... you can use epsilon in the middle of a word... it just means the empty string... Also, by "standard" I think OP means the kind of regexps used in theory of computation textbooks... without . or ^$ or \w or [1-9] or {3}, but with \epsilon, \lambda etc.

Brian Postow 2010-01-12 14:55:39

I am not sure what you mean by "That's why in books they invent some special characters which you're expected to lookup prior to using in any specific language.". Please clarify/rewrite and I will accept.

drozzy 2010-01-12 15:47:57

That was a quick guess. It has been long time since I studied theoretical regexp. Removed. Forget it :)

Antony Hatchkins 2010-01-12 16:17:28

Answer 2

+3 A:

I'm not exactly sure how match works in python, but I think you might need to add ^....$ to your RE. RegExp matching usually matches sub-strings, and it finds the largest match, in the case of p.match('aa') that's "a" (probably the first one). ^...$ makes sure that you're matching the ENTIRE string, which is I believe what you want.

Theoretical/standard reg exps assume that you're always matching the whole string, because you're using it to define a language of strings that match, not find a substring in an input string.

Brian Postow 2010-01-12 14:52:08

^ is not necessary here. It is assumed in re.match. In re.search it is not which is the only difference between those two.

Antony Hatchkins 2010-01-12 14:52:57

interesting is $ needed though? because if it isn't you're regexp needs to be ...(a$|$) otherwise it matches anything with an a in it...

Brian Postow 2010-01-12 15:01:31

`$` represents a line end, I don't think this is what you're looking for. `re.match` does that already as with `^` (for line beginning).

jathanism 2010-01-12 15:08:55

Pierre's examples would seem to differ, implying that ^$ are in fact necessary... and ^$ usually don't mean LINE end and beginning, they mean STRING end and beginning...

Brian Postow 2010-01-12 15:14:35

match already forces the match to start at the beggining of the string but does not force the end, so you should use `$`

João Portela 2010-01-12 15:25:48

ah, ok, so $ is necessary, but ^ is not. Got it

Brian Postow 2010-01-12 15:38:39

yep, that's right

Antony Hatchkins 2010-01-12 16:13:38

Answer 3

+1 A:

You're matching because your regex matches any zero-width segment of any specimen text. You need to anchor your regex. Here's one way of doing it, using a zero-width lookahead assertion:

re.compile(r'^(a(?!a)|b)*$')

Jonathan Feinberg 2010-01-12 14:57:04

Answer 4

+5 A:

Actually, the example works just fine ... to a small details. I would write:

>>> p = re.compile('b*(abb*)*a?')
>>> m = p.match('aa')
>>> print m.group(0)
'a'
>>> m = p.match('abbabbabababbabbbbbaaaaa')
>>> print m.group(0)
abbabbabababbabbbbba

Note that the group 0 returns the part of the string matched by the regular expression.

As you can see, the expression matches a succession of a and b without repetition of a. If indeed, you want to check the whole string, you need to changed slightly:

>>> p = re.compile('^b*(abb*)*a?$')
>>> m = p.match('aa')
>>> print m
None

the ^ and $ force recognition of the beginning and end of the string.

At last, you can combine both methods by using the first regular expression, but testing at the end:

>>> len(m.group(0)) == len('aa')

Added: For the second part of the OT, it seems to me there is no discrepancy between the standard regex and the python implementation. Of course, the notation is slightly different, and the python implementation suggest some extensions (as most other packages).

PierreBdR 2010-01-12 15:08:42

+1 for beating me to the answer! :) btw the `^` is not mandatory because re.match() only attempts the pattern at the very start of the string.

João Portela 2010-01-12 15:11:00

oh.. your example is wrong. `p = re.compile('b*(abb)*a?')` does not match 'aba'

João Portela 2010-01-12 15:14:58

oops .. just forgot a star in the first regular expression ... corrected!

PierreBdR 2010-01-12 15:17:52

yet you didn't answer the OP's 'two-fold' question ;)

Antony Hatchkins 2010-01-12 16:12:07

Answer 5

+1 A:

Your second re should be an appropriate replacement for epsilon, as best as I understand it, though I've never seen epsilon in a regex before.

For what it's worth, your pattern is matching 'a'. That is to say, it is matching:

zero or more "b"s (choosing zero)
zero or more "(abb*)"s (choosing zero)
one "a" or word ending (choosing an a).

As Jonathan Feinberg pointed out, if you want to ensure the whole string matches, you have to anchor the beginning ('^') and end ('$') of your regex. You should also use a raw string whenever constructing regexes in python: r'my regex'. That will prevent excessive backslash escaping confusion.

jcdyer 2010-01-12 15:17:22

Answer 6

+1 A:

the problem with your expression is that it matches the empty string, meaning that if you do:

>>> p = re.compile('b*(abb*)*(a|)')
>>> p.match('c').group(0)
''

and since re.match attempts to match the start of the string, you have to tell it to match it until the end of the string. just use $ for that

>>> p = re.compile(r'b*(abb*)*(a|)$')
>>> print p.match('c')
None
>>> p.match('ababababab').group(0)
'ababababab'

ps- you may have noted that i used r'pattern' instead of 'pattern' more on that here (first paragraphs)

João Portela 2010-01-12 15:17:59

Answer 7

+3 A:

drozzy 2010-01-12 17:24:04

Excellent summing up of the answers!

Brian Postow 2010-01-12 21:39:33

"...python matches the shortest substring..." is wrong. It just doesn't necessarily match the longest substring, like a mathematically-correct regular expression would.

Alan Moore 2010-01-12 22:57:11

@Alan: It matches the shortest substring if no ^ or $ is provided.

drozzy 2010-01-16 21:49:24

Consider the regex `a+|a+b*` applied to `aaabb`. The shortest possible match would be `a`, while the longest match is the whole string, but Python matches `aaa`. It tries various ways to find a match, determined by the way the regex is written, and stops as soon as it finds one that works. If you reverse the order of the alternatives - `a+b*|a+` - it will match the whole string. If you add an anchor - `(a+|a+b*)$` - the first alternative fails because it doesn't reach the end, but the second alternative succeeds. Not the shortest possible match, just the first.

Alan Moore 2010-01-17 04:08:33

By the way, this "regex-directed" behavior is not unique to Python. All of the so-called Perl-compatible flavors do the same: Perl, PHP, Ruby JavaScript, Java, .NET, etc.. ( http://www.regular-expressions.info/engine.html )

Alan Moore 2010-01-17 04:21:51

@Alan: Yes python does "match" aaa, but it does NOT match "aaab". What I mean is that python will match the shortest possible match. Errr... or python will match any substring if it matches. But thanks for pointing that out, my wording is not perfect.

drozzy 2010-01-17 19:06:09

actually, a mathematically correct regexp doesn't match the "longest" substring. It either matches the WHOLE STRING, or it doesn't. there's no "it matched most of the string..." which is why I keep getting confused. I would think that in a PL, matching the longest would be more useful than the shortest... I mean, if you want to match 'a*' in babaaaaaaa you don't want to find the empty string before the b... and if you search in aaaabaa you don't want to find the empty string before the 1st a, do you?

Brian Postow 2010-01-19 20:42:40

ansaurus

tags:

views:

answers:

Standard Regex vs python regex discrepancy

related questions