tags:

views:

219

answers:

4

This is a great regular expression for dates... However it hangs indefinitely on this one page I tried... I wanted to try this page ( http://pleac.sourceforge.net/pleac%5Fpython/datesandtimes.html ) for the fact that it does have lots of dates on it and I want to grab all of them. I don't understand why it is hanging when it doesn't on other pages... Why is my regexp hanging and/or how could I clean it up to make it better/efficient ?

Python Code:

monthnames = "(?:Jan\w*|Feb\w*|Mar\w*|Apr\w*|May|Jun\w?|Jul\w?|Aug\w*|Sep\w*|Oct\w*|Nov(?:ember)?|Dec\w*)"

pattern1 = re.compile(r"(\d{1,4}[\/\\\-]+\d{1,2}[\/\\\-]+\d{2,4})")

pattern4 = re.compile(r"(?:[\d]*[\,\.\ \-]+)*%s(?:[\,\.\ \-]+[\d]+[stndrh]*)+[:\d]*[\ ]?(PM)?(AM)?([\ \-\+\d]{4,7}|[UTCESTGMT\ ]{2,4})*"%monthnames, re.I)

patterns = [pattern4, pattern1]

for pattern in patterns:
    print re.findall(pattern, s)

btw... when i say im trying it against this site.. I'm trying it against the webpage source.

A: 

The Python regular expression evaluator can take a long time. Unfortunately, it runs in a worst case exponential time.

I assume your "s" contains a copy of the entire page. If so, that could cause very long backtracks in the regular expression evaluator. Perhaps you should break the page up into smaller chunks, and run them individually through your regular expression. You could use an HTML parser like beautifulsoup and run the regular expression on each text node individually. This might reduce the runtime.

McPherrinM
hm. good advice sir. i even changed my regex to `pattern4 = re.compile(r"(?:[\d]*[\,\.\ \-]+)*%s(?:[\,\.\ \-]+[\d]+[stndrh]*)+?[:\d]*[\ ]?(?:PM)?(?:AM)?(?:[\ \-\+\d]{4,7})?(?:[UTCESTGMT\ ]{2,4})?"%monthnames, re.I)`but still the same... I'll give this a try tho, ill first try line by line and then Bsoup..
Line by Line is a reasonable solution, however, what if a date is spread across two lines? If you're not worried about that, then line-by-line will be faster.
McPherrinM
A: 

The way the regex is written will cause a lot of backtracking. In addition to the tip about running it on smaller chunks of text, you can also have a simpler (and thus faster) regex that filters out text that cannot match.

Alex Brasetvik
A: 

First, you should read up on what an r"" string means: you only have to put in backslashes where you really want backslashes, so your regex should really just be:

monthnames = "(?:Jan\w*|Feb\w*|Mar\w*|Apr\w*|May|Jun\w?|Jul\w?|Aug\w*|Sep\w*|Oct\w*|Nov(?:ember)?|Dec\w*)"

pattern1 = re.compile(r"(\d{1,4}[-/]+\d{1,2}[-/]+\d{2,4})")

pattern4 = re.compile(r"(?:\d*[,. -]+)*%s(?:[,. -]+\d+[stndrh]*)+[:\d]*[ ]?(PM)?(AM)?([ -+\d]{4,7}|[UTCESTGMT ]{2,4})*"%monthnames, re.I)

As to your real problem, Python doesn't do well with a * nested inside a *. Change pattern4 to this (the first \d* becomes \d+):

pattern4 = re.compile(r"(?:\d+[,. -]+)*%s(?:[,. -]+\d+[stndrh]*)+[:\d]*[ ]?(PM)?(AM)?([ -+\d]{4,7}|[UTCESTGMT ]{2,4})*"%monthnames, re.I)

and the regex returns quickly, printing this:

[('', '', '2003'), ('', '', '1973'), ('', 'AM', ' 1973'), ('', '', '1981"')]
['2003-08-06', '2003-08-07', '2003-07-23', '1973-01-18', '3/14/1973', '16/6/1981', '16/6/1981', '16/6/1981', '16/6/1981'
, '08/08/2003']

though I don't know if that's what you wanted.

Ned Batchelder
Wow, thats pretty good advice ... it no longer hangs with your suggestions. however it isn't in the format i wanted nor returning the matches i need... but I might be able to do alittle tweaking to to do so... Before the changes it would "strictly" return matches based on a month name (Feb or November) being in present... But now it returns any 4 digit year now.. (pattern4 i mean...)I'll c what I can do to get it right... thanks !
essentially... pattern4 = re.compile(r"(?:[\d]+[,. -]+)*%s(?:[,. -]+[\d]+[stndrh]*)+?[:\d]*[ ]?(?:PM)?(?:AM)?(?:[ -+\d]{4,7})?(?:[UTCESTGMT ]{2,4})?"%monthnames, re.I)wow, changing that one nexted * changed everything... This one works kinda well...
+2  A: 

You should read Mastering Regular Expressions. The problem is:

(?:[\d]*[\,\.\ \-]+)*

which takes exponential time. Try using:

(?:[\d,. \-]*[,. \-])?

which should match the same things but take linear time. Having checked your example, this does indeed speed things up.

You also appear to have accidentally introduced capturing groups into your pattern at some point: change e.g. (AM) to (?:AM) to fix that. This gets us the following output from your example above:

[' Aug  6 20:43:20 2003', ' Mar 14 06:02:55 1973', ' March 14 06:02:55 AM 1973', ' Jun 16 20:18:03 1981']
['2003-08-06', '2003-08-07', '2003-07-23', '1973-01-18', '3/14/1973', '16/6/1981', '16/6/1981', '16/6/1981', '16/6/1981', '08/08/2003']

To get into details (which the book I reference is very good at), *s and +s work (in NFAs like python's re) rather like a loop. The original pattern's inner loop will match against a long string of whitespace characters, but when the subsequent pattern fails to match, it will "give them up" one at a time. The outer loop will then rerun the inner loop on the remaining pattern, and of course it will immediately grab the whitespace characters again. Each time one instance of the inner loop relinquishes a whitespace character, a new copy will be summoned to grab it again. Eventually, once the engine has gone through every possible way of splitting that whitespace string (an exponential number of possibilities), it will move the starting point forwards one space...and try again.

On another note, your pattern looks a bit bonkers ;)

chrispy
thanks.. great... running pattern4 now returns [' Aug 6 ', ' Mar 14 ', ' March 14 ', ' Jun 16 '] .. thanks
pattern4 = re.compile(r"(?:[\d,. \-]*[,. \-])?%s(?:[,. -]+[\d]+[stndrh]*)+[:\d]*[ ]?(?:PM)?(?:AM)?(?:[ -+\d]{4,7})?(?:[UTCESTGMT ]{2,4})?"%monthnames, re.I)Returns what I want... Thanks!
oh yea.. and it's bonkers because of the WIDE variety of ways to represent a date... versatility was the goal :p thanks