Your regex is suffering from catastrophic backtracking. If it can find a match it's fine, but if it can't, it has to try a virtually infinite number of possibilities before it gives up. Every one of those [\s\S]*?
constructs ends up trying to match all the way to the end of the document, and the interaction between them creates a staggering amount of useless work.
Python doesn't support atomic groups, but here's a little trick you can use to imitate them:
a=re.findall(r"""(?=(<ul>[\s\S]*?<li><a href="(?P<link>[\s\S]*?)"[\s\S]*?<img src="(?P<img>[\s\S]*?)"[\s\S]*?<br/>[\s\S]*?</li>[\s\S]*?</li>[\s\S]*?</li>[\s\S]*?</ul>))\1d""",html)
print a
If the lookahead succeeds, the whole <UL>
element is captured in group #1, the match position resets to the beginning of the element, then the \1
backreference consumes the element. But if the next character is not d
, it does not go back and muck about with all those [\s\S]*?
constructs again, like your regex does.
Instead, the regex engine goes straight back to the beginning of the <UL>
element, then bumps ahead one position (so it's between the <
and the u
) and tries the lookahead again from the beginning. It keeps doing that until it finds another match for the lookahead, or it reaches the end of the document. In this way, it will fail (the expected result) in about the same time your first regex took to succeed.
Note that I'm not presenting this trick as a solution, just trying to answer your question as to why your regex seems to hang. If I were offering a solution, I would say to stop using [\s\S]*?
(or [\s\S]*
, or .*
, or .*?
); you're relying on that too much. Try to be as specific as you reasonably can--for example, instead of:
<a href="(?P<link>[\s\S]*?)"[\s\S]*?<img src="(?P<img>[\s\S]*?)"[\s\S]*?
...use:
<a href="(?P<link>[^"]*)"[^>]*><img src="(?P<img>[^"]*)"[^>]*>
But even that has serious problems. You should seriously consider using an HTML parser for this job. I love regexes too, but you're asking too much from them.