views:

335

answers:

1

This question is similar to "How to concisely cascade through multiple regex statements in Python" except instead of matching one regular expression and doing something I need to make sure I do not match a bunch of regular expressions, and if no matches are found (aka I have valid data) then do something. I have found one way to do it but am thinking there must be a better way, especially if I end up with many regular expressions.

Basically I am filtering URL's for bad stuff ("", \\", etc.) that occurs when I yank what looks like a valid URL out of an HTML document but it turns out to be part of a JavaScript (and thus needs to be evaluated, and thus the escaping characters). I can't use Beautiful soup to process these pages since they are far to mangled (actually I use BeautifulSoup, then fall back to my ugly but workable parser).

So far I have found the following works relatively well: I compile a dict or regular expressions outside the main loop (so I only have to compile it once, but benefit from the speed increase every time I use it), I then loop a URL through this dict, if there is a match then the URL is bad, if not the url is good:

regex_bad_url = {"1" :   re.compile('\"\"'),
                 "2" :   re.compile('\\\"')}

Followed by:

url_state = "good"

for key, pattern in regex_bad_url_components.items():
    match = re.search(pattern, url)
    if (match):
        url_state = "bad"

if (url_state == "good"):
# do stuff here ...

Now the obvious thought is to use regex "or" ("|"), i.e.:

re.compile('(\"\"|\\\")')

Which reduces the number of compares and whatnot, but makes it much harder to trouble shoot (with one expression per compare I can easily add a print statement like:

print "URL: ", url, " matched by key ", key

So is there someway to get the best of both worlds (i.e. minimal number of compares) yet still be able to print out which regex is matching the URL, or do I simply need to bite the bullet and have my slower but easier to troubleshoot code when debugging and then squoosh all the regex's together into one line for production? (which means one more step of programming and code maintenance and possible problems).

Update:

Good answer by Dave Webb, so the actual code for this would look like:

match = re.search(r'(?P<double_quotes>\"\")|(?P<slash_quote>\\\")', fullurl)
if (match == None):
    # do stuff here ...
else:
    #optional for debugging
    print "url matched by", match.lastgroup
+2  A: 

"Squoosh" all the regexes into one line but put each in a named group using (?P<name>...) then use MatchOjbect.lastgroup to find which matched.

Dave Webb