views:

98

answers:

2

Please help me to discover whether this is a bug in Python (2.6.5), in my competence at writing regexes, or in my understanding of pattern matching.

(I accept that a possible answer is "Upgrade your Python".)

I'm trying to parse a Yubikey token, allowing for the optional extras.

When I use this regex to match a token without any optional extras (that is, containing only the stuff that matches the two capture groups), the match fails:

r'^\t?[^a-z0-9]?([cbdefghijklnrtuv1-8]{0,32})\t?([cbdefghijklnrtuv1-8]{32})\t?\r?\n?$'

However, if I make the first group non-greedy:

r'^\t?[^a-z0-9]?([cbdefghijklnrtuv1-8]{0,32}?)\t?([cbdefghijklnrtuv1-8]{32})\t?\r?\n?$'

it succeeds.

So, OK, it's working, but I would have thought that the only difference in end result between these two regexes would be performance.

Both Expresso and Regex Coach like both patterns.

What have I missed?


Here are two of the strings I'm testing with.

No optional extras (the ones that can fail):

"vvbrentlnccnhgfgrtetilbvckjcegblehfvbihrdcui"

With optional extras (haven't failed so far; actual tabs are shown here as "_"):

"_!_8R5Gkruvfgheufhcnhllchgrfiutujfh_"
"_!1U4Knivdgvkfthrd_brvejhudrdnbunellrjjkkccfnggbdng_"

I've tried to reproduce it using the suggestion from Alex Martelli, and it doesn't fail in the raw Python environment, so I'm going to revisit my code (I'm actually hacking on yubikey-python); I'll report back in a day or so.


My apologies to everyone. I cannot reproduce the problem. When it occurred, I was reading input via getpass; I suspect that an accidental foreign keystroke got in the way.

I am going to close the question. If whoever upvoted the question wishes to remove their vote, that is fair.

Very sorry.

+3  A: 

I'd recommend using yubikey-python for Python interfacing to yubikey -- but, that's a side (and strictly pragmatical) issue;-).

In theory, there should be no cases where a choice between greedy and non-greedy causes a RE to match in one case and fail in another -- it should only affects what gets matched (and as you mention performance), not whether the match succeeds at all, since REs are supposed to backtrack for the purpose.

Problem is, I cannot reproduce the problem -- I don't have a yubikey at hand and the tests in this file show no differences between the two REs' match/no-match behavior.

Could you please post a couple of failing examples (where one matches and the other one doesn't), ideally by editing your question, so I can reproduce the problem and try to cut it down to its minimum? Sounds like there may be a RE bug, but without reproducible cases I can't check if and when it's been fixed, already reported, or what. Thanks!

Edit the OP has now posted one failing example but I still can't reproduce:

$ py26
Python 2.6.5 (r265:79359, Mar 24 2010, 01:32:55) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> r1 = re.compile(r'^\t?[^a-z0-9]?([cbdefghijklnrtuv1-8]{0,32})\t?([cbdefghijklnrtuv1-8]{32})\t?\r?\n?$')
>>> r2 = re.compile(r'^\t?[^a-z0-9]?([cbdefghijklnrtuv1-8]{0,32}?)\t?([cbdefghijklnrtuv1-8]{32})\t?\r?\n?$'
... )
>>> nox="vvbrentlnccnhgfgrtetilbvckjcegblehfvbihrdcui"
>>> r1.match(nox)
<_sre.SRE_Match object at 0xcc458>
>>> r2.match(nox)
<_sre.SRE_Match object at 0xcc920>
>>> 

i.e., match succeeds in both cases, as it should -- and that's exactly the same 2.6.5 Python version as the OP is using. OP, pls, show the results of this simple sequence of commands on your platform and tell us exactly what the platform is, since it looks like a weird platform-dependent bug... thanks!

Alex Martelli
@FM, yep, tx, fixing now.
Alex Martelli
Alex, even though mine was a non-question, I've accepted your answer as being the most thoughtful and informative. No reflection on other answers, though!
Brent.Longborough
A: 

You're right: simply switching from greedy to non-greedy quantifiers should not cause a regex to stop working. It can change how quickly the regex matches (or fails to match), how much it matches, and which parts get captured in which groups, that's all.

(The following "solution" is not applicable, but the question still doesn't indicate that a case-insensitive match is being performed, so I'll leave it.)

Your problem is that the strings with the optional extras also have uppercase letters in them, and your regex only allows for lowercase letters. Stick a (?i) on the front or the regex and it works just fine.

Alan Moore
@Alan, but the OP said he's experiencing failures only on strings **without** the optional extras, so the presence of uppercase **in** the optional extras seems irrelevant to his reported problem.
Alex Martelli
@Alex: So he did, but when I tested it, the "without" string *matched*, and the "with" strings didn't. It seems I unconsciously revised the question to fit the observed behavior. Are we being scammed, or what? ;)
Alan Moore
@Alan: Sorry, in the interests of keeping it (too) simple, I omitted the ",re.I".
Brent.Longborough