tags:

views:

39

answers:

3

I have a regular expression, links = re.compile('<a(.+?)href=(?:"|\')?((?:https?://|/)[^\'"]+)(?:"|\')?(.*?)>(.+?)</a>',re.I).findall(data)

to find links in some html, it is taking a long time on certain html, any optimization advice?

One that it chokes on is http://freeyourmindonline.net/Blog/

+2  A: 

Is there any reason you aren't using an html parser? Using something like BeautifulSoup, you can get all links without using an ugly regex like that.

Daenyth
is it possible to get all of the data that the regex gets? the link, anchor text and the bits between a and href and after href until the end of the tag?
Matt
@Matt: I find it very difficult to understand what your regex is doing, but the general idea of HTML parsers is that they make it easy to parse HTML. I'm sure whatever it is that you're trying to do it's quite straightforward once you've read the documentation.
Mark Byers
Yes, very much so. This appears to be a duplicate of your question, and is answered: http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautiful-soup
Daenyth
+2  A: 

I'd suggest using BeautifulSoup for this task.

Mark Byers
A: 

How about more straight handling of href's?

re_href = re.compile(r"""<\s*a(?:[^>]+?)href=("[^"]*(\\"[^"]*)*"|'[^']*(\\'[^']*)*'|[^\s>]*)[^>]*>""", re.I)

That takes about 0.007 seconds in comparsion with your findall which takes 38.694 seconds on my computer.

ony