views:

354

answers:

7

I am reading a book and they provide an example of how to match a given string with regular expressions. Here is their example:

b*(abb*)*(a|∊) - Strings of a's and b's with no consecutive a's.

Now I've tried converting it to python like so:

>> p = re.compile(r'b*(abb*)*(a|)') # OR
>> p = re.compile(r'b*(abb*)*(a|\b)')

# BUT it still doesn't work
>>> p.match('aa')
<_sre.SRE_Match object at 0x7fd9ad028c68>

My question is two-fold:

  1. What is the equivalent of epsilon in python to make the above example work?
  2. Can someone explain to me why theoretical or standard way of doing regular expressions does not work in python? Might it have something to do with the longest vs shortest matching?

Clarification: For people asking what standard regex is - it is the formal language theory standard: http://en.wikipedia.org/wiki/Regular_expression#Formal_language_theory

+3  A: 

1

  • Use bool(p.match('aa')) to check if the regexp matches or not

  • p = re.compile('b*(abb*)*a?$')

  • \b matches border of string; place between \w and \W (word characters and non-word characters)

2

Regexp is quite standard in python. Yet every language has some flavour of them, they are not 100% portable. There are minor differences which you're expected to lookup prior to using regexp in any specific language.

Addition

\epsilon does not have special symbol in python. It is an empty character set.

In your example a|\epsilon is equivalent to (a|) or just a?. After which $ is obligatory to match end of string.

Antony Hatchkins
I don't think the OP wants a word boundary... you can use epsilon in the middle of a word... it just means the empty string... Also, by "standard" I think OP means the kind of regexps used in theory of computation textbooks... without . or ^$ or \w or [1-9] or {3}, but with \epsilon, \lambda etc.
Brian Postow
I am not sure what you mean by "That's why in books they invent some special characters which you're expected to lookup prior to using in any specific language.". Please clarify/rewrite and I will accept.
drozzy
That was a quick guess. It has been long time since I studied theoretical regexp. Removed. Forget it :)
Antony Hatchkins
+3  A: 

I'm not exactly sure how match works in python, but I think you might need to add ^....$ to your RE. RegExp matching usually matches sub-strings, and it finds the largest match, in the case of p.match('aa') that's "a" (probably the first one). ^...$ makes sure that you're matching the ENTIRE string, which is I believe what you want.

Theoretical/standard reg exps assume that you're always matching the whole string, because you're using it to define a language of strings that match, not find a substring in an input string.

Brian Postow
^ is not necessary here. It is assumed in re.match. In re.search it is not which is the only difference between those two.
Antony Hatchkins
interesting is $ needed though? because if it isn't you're regexp needs to be ...(a$|$) otherwise it matches anything with an a in it...
Brian Postow
`$` represents a line end, I don't think this is what you're looking for. `re.match` does that already as with `^` (for line beginning).
jathanism
Pierre's examples would seem to differ, implying that ^$ are in fact necessary... and ^$ usually don't mean LINE end and beginning, they mean STRING end and beginning...
Brian Postow
match already forces the match to start at the beggining of the string but does not force the end, so you should use `$`
João Portela
ah, ok, so $ is necessary, but ^ is not. Got it
Brian Postow
yep, that's right
Antony Hatchkins
+1  A: 

You're matching because your regex matches any zero-width segment of any specimen text. You need to anchor your regex. Here's one way of doing it, using a zero-width lookahead assertion:

re.compile(r'^(a(?!a)|b)*$')
Jonathan Feinberg
+5  A: 

Actually, the example works just fine ... to a small details. I would write:

>>> p = re.compile('b*(abb*)*a?')
>>> m = p.match('aa')
>>> print m.group(0)
'a'
>>> m = p.match('abbabbabababbabbbbbaaaaa')
>>> print m.group(0)
abbabbabababbabbbbba

Note that the group 0 returns the part of the string matched by the regular expression.

As you can see, the expression matches a succession of a and b without repetition of a. If indeed, you want to check the whole string, you need to changed slightly:

>>> p = re.compile('^b*(abb*)*a?$')
>>> m = p.match('aa')
>>> print m
None

the ^ and $ force recognition of the beginning and end of the string.

At last, you can combine both methods by using the first regular expression, but testing at the end:

>>> len(m.group(0)) == len('aa')

Added: For the second part of the OT, it seems to me there is no discrepancy between the standard regex and the python implementation. Of course, the notation is slightly different, and the python implementation suggest some extensions (as most other packages).

PierreBdR
+1 for beating me to the answer! :) btw the `^` is not mandatory because re.match() only attempts the pattern at the very start of the string.
João Portela
oh.. your example is wrong. `p = re.compile('b*(abb)*a?')` does not match 'aba'
João Portela
oops .. just forgot a star in the first regular expression ... corrected!
PierreBdR
yet you didn't answer the OP's 'two-fold' question ;)
Antony Hatchkins
+1  A: 

Your second re should be an appropriate replacement for epsilon, as best as I understand it, though I've never seen epsilon in a regex before.

For what it's worth, your pattern is matching 'a'. That is to say, it is matching:

  • zero or more "b"s (choosing zero)
  • zero or more "(abb*)"s (choosing zero)
  • one "a" or word ending (choosing an a).

As Jonathan Feinberg pointed out, if you want to ensure the whole string matches, you have to anchor the beginning ('^') and end ('$') of your regex. You should also use a raw string whenever constructing regexes in python: r'my regex'. That will prevent excessive backslash escaping confusion.

jcdyer
+1  A: 

the problem with your expression is that it matches the empty string, meaning that if you do:

>>> p = re.compile('b*(abb*)*(a|)')
>>> p.match('c').group(0)
''

and since re.match attempts to match the start of the string, you have to tell it to match it until the end of the string. just use $ for that

>>> p = re.compile(r'b*(abb*)*(a|)$')
>>> print p.match('c')
None
>>> p.match('ababababab').group(0)
'ababababab'

ps- you may have noted that i used r'pattern' instead of 'pattern' more on that here (first paragraphs)

João Portela
+3  A: 
drozzy
Excellent summing up of the answers!
Brian Postow
"...python matches the shortest substring..." is wrong. It just doesn't necessarily match the longest substring, like a mathematically-correct regular expression would.
Alan Moore
@Alan: It matches the shortest substring if no ^ or $ is provided.
drozzy
Consider the regex `a+|a+b*` applied to `aaabb`. The shortest possible match would be `a`, while the longest match is the whole string, but Python matches `aaa`. It tries various ways to find a match, determined by the way the regex is written, and stops as soon as it finds one that works. If you reverse the order of the alternatives - `a+b*|a+` - it will match the whole string. If you add an anchor - `(a+|a+b*)$` - the first alternative fails because it doesn't reach the end, but the second alternative succeeds. Not the shortest possible match, just the first.
Alan Moore
By the way, this "regex-directed" behavior is not unique to Python. All of the so-called Perl-compatible flavors do the same: Perl, PHP, Ruby JavaScript, Java, .NET, etc.. ( http://www.regular-expressions.info/engine.html )
Alan Moore
@Alan: Yes python does "match" aaa, but it does NOT match "aaab". What I mean is that python will match the shortest possible match. Errr... or python will match any substring if it matches. But thanks for pointing that out, my wording is not perfect.
drozzy
actually, a mathematically correct regexp doesn't match the "longest" substring. It either matches the WHOLE STRING, or it doesn't. there's no "it matched most of the string..." which is why I keep getting confused. I would think that in a PL, matching the longest would be more useful than the shortest... I mean, if you want to match 'a*' in babaaaaaaa you don't want to find the empty string before the b... and if you search in aaaabaa you don't want to find the empty string before the 1st a, do you?
Brian Postow