ansaurus

Question

python re problem

Answer 1

+7 A:

The expression [\s\S]*? can match any amount of anything. This can potentially cause an enormous amount of backtracking in the case that the match fails. If you are more specific about what you can and can't match then it will allow the match to fail faster.

Also, I'd advise you to use an HTML parser instead of regular expressions for this. Beautiful Soup is an excellent library that is easy to use.

Mark Byers 2010-10-03 23:38:40

mlzboy 2010-10-03 23:46:01

@mlzboy: Perhaps I wasn't clear enough? Your regular expression is **too slow**. The server doesn't respond because it hasn't finished searching for a match.

Mark Byers 2010-10-04 00:00:57

@i dont think add a character 'd' will much slower than before

mlzboy 2010-10-04 01:09:37

@mlzboy: The fact that you don't understand why a regular expression can suddenly become many orders of magnitude slower by adding a single character shows that you don't *really* know how regular expressions work. Writing regular expressions that work correctly and perform well is not simple. Especially so when you are trying to use them to parse a language that is not regular. BeautifulSoup is very simple and easy to use. In the one hour you have wasted trying to parse HTML with regex you could have learned how to use Beautiful Soup correctly.

Mark Byers 2010-10-04 01:19:58

I suggest also you read this question and the top voted answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

Mark Byers 2010-10-04 01:21:31

@Mark Byers since i have use pyquery or lxml,i'll never use beautifulsoup again,another reason is i didn't want in my project use many way to extract information,also regular expression has gui tools help,it's more suit for me

mlzboy 2010-10-04 02:14:59

Answer 2

+4 A:

Your regex is suffering from catastrophic backtracking. If it can find a match it's fine, but if it can't, it has to try a virtually infinite number of possibilities before it gives up. Every one of those [\s\S]*? constructs ends up trying to match all the way to the end of the document, and the interaction between them creates a staggering amount of useless work.

Python doesn't support atomic groups, but here's a little trick you can use to imitate them:

a=re.findall(r"""(?=(<ul>[\s\S]*?<li><a href="(?P<link>[\s\S]*?)"[\s\S]*?<img src="(?P<img>[\s\S]*?)"[\s\S]*?<br/>[\s\S]*?</li>[\s\S]*?</li>[\s\S]*?</li>[\s\S]*?</ul>))\1d""",html)
print a

If the lookahead succeeds, the whole <UL> element is captured in group #1, the match position resets to the beginning of the element, then the \1 backreference consumes the element. But if the next character is not d, it does not go back and muck about with all those [\s\S]*? constructs again, like your regex does.

Instead, the regex engine goes straight back to the beginning of the <UL> element, then bumps ahead one position (so it's between the < and the u) and tries the lookahead again from the beginning. It keeps doing that until it finds another match for the lookahead, or it reaches the end of the document. In this way, it will fail (the expected result) in about the same time your first regex took to succeed.

Note that I'm not presenting this trick as a solution, just trying to answer your question as to why your regex seems to hang. If I were offering a solution, I would say to stop using [\s\S]*? (or [\s\S]*, or .*, or .*?); you're relying on that too much. Try to be as specific as you reasonably can--for example, instead of:

<a href="(?P<link>[\s\S]*?)"[\s\S]*?<img src="(?P<img>[\s\S]*?)"[\s\S]*?

...use:

<a href="(?P<link>[^"]*)"[^>]*><img src="(?P<img>[^"]*)"[^>]*>

But even that has serious problems. You should seriously consider using an HTML parser for this job. I love regexes too, but you're asking too much from them.

Alan Moore 2010-10-04 03:07:20

mlzboy 2010-10-04 09:31:00

@mlzboy: Glad I could help. If you really want to master regular expressions, the best thing you can do is read The Book: http://www.oreilly.com/catalog/regex3/index.html. Very accessible, it can lead you to an amazing grasp of regexes, without forcing you to become a computer scientist. :P

Alan Moore 2010-10-04 18:34:50

ansaurus

tags:

views:

answers:

python re problem

related questions