Apply multiple negative regex to expression in Python | ansaurus

tags:

views:

335

answers:

1

+1 Q:

Apply multiple negative regex to expression in Python

This question is similar to "How to concisely cascade through multiple regex statements in Python" except instead of matching one regular expression and doing something I need to make sure I do not match a bunch of regular expressions, and if no matches are found (aka I have valid data) then do something. I have found one way to do it but am thinking there must be a better way, especially if I end up with many regular expressions.

Basically I am filtering URL's for bad stuff ("", \\", etc.) that occurs when I yank what looks like a valid URL out of an HTML document but it turns out to be part of a JavaScript (and thus needs to be evaluated, and thus the escaping characters). I can't use Beautiful soup to process these pages since they are far to mangled (actually I use BeautifulSoup, then fall back to my ugly but workable parser).

So far I have found the following works relatively well: I compile a dict or regular expressions outside the main loop (so I only have to compile it once, but benefit from the speed increase every time I use it), I then loop a URL through this dict, if there is a match then the URL is bad, if not the url is good:

regex_bad_url = {"1" :   re.compile('\"\"'),
                 "2" :   re.compile('\\\"')}

Followed by:

url_state = "good"

for key, pattern in regex_bad_url_components.items():
    match = re.search(pattern, url)
    if (match):
        url_state = "bad"

if (url_state == "good"):
# do stuff here ...

Now the obvious thought is to use regex "or" ("|"), i.e.:

re.compile('(\"\"|\\\")')

Which reduces the number of compares and whatnot, but makes it much harder to trouble shoot (with one expression per compare I can easily add a print statement like:

print "URL: ", url, " matched by key ", key

So is there someway to get the best of both worlds (i.e. minimal number of compares) yet still be able to print out which regex is matching the URL, or do I simply need to bite the bullet and have my slower but easier to troubleshoot code when debugging and then squoosh all the regex's together into one line for production? (which means one more step of programming and code maintenance and possible problems).

Update:

Good answer by Dave Webb, so the actual code for this would look like:

match = re.search(r'(?P<double_quotes>\"\")|(?P<slash_quote>\\\")', fullurl)
if (match == None):
    # do stuff here ...
else:
    #optional for debugging
    print "url matched by", match.lastgroup

+2 A:

"Squoosh" all the regexes into one line but put each in a named group using (?P<name>...) then use MatchOjbect.lastgroup to find which matched.

Dave Webb 2009-04-11 07:09:30

related questions

Programmatically talking to a Serial Port in OS X or Linux

Best ways to teach a beginner to program?

Calling a Function From a String With the Function's Name in Python

An executable Python app

Text Editor For Linux (Besides Vi)?

What Hosting Service is best for Django applications?

File size differences after copying a file to a server vía FTP

Python: what is the difference between (1,2,3) and [1,2,3], and when should I use each?

Python: What OS am I running on?

How do I make a menu in python that does not require the user to press (enter) to make a selection?

How do you express binary literals in Python?

What is the most efficient graph data structure in Python?

Adding a Method to an Existing Object

How to learn Python: Good Example Code?

How do I use Python's itertools.groupby()?

Python and MySQL

Class views in Django

Is there an IDE that provides code completion for Python

Using 'in' to match an attribute of Python objects in an array

cx_Oracle - what is the best way to iterate over a result set?

cx_Oracle - How do I access Oracle from Python?

Continuous Integration System for a Python Codebase

Get a preview jpeg of a pdf on windows?

How can I find the full path to a font from its display name on a Mac?

XML Processing in Python