ansaurus

Question

Optimizing python link matching regular expression

Answer 1

+2 A:

Is there any reason you aren't using an html parser? Using something like BeautifulSoup, you can get all links without using an ugly regex like that.

Daenyth 2010-05-31 18:41:17

is it possible to get all of the data that the regex gets? the link, anchor text and the bits between a and href and after href until the end of the tag?

Matt 2010-05-31 18:47:08

@Matt: I find it very difficult to understand what your regex is doing, but the general idea of HTML parsers is that they make it easy to parse HTML. I'm sure whatever it is that you're trying to do it's quite straightforward once you've read the documentation.

Mark Byers 2010-05-31 18:56:06

Yes, very much so. This appears to be a duplicate of your question, and is answered: http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautiful-soup

Daenyth 2010-05-31 18:58:02

Answer 2

+2 A:

I'd suggest using BeautifulSoup for this task.

Mark Byers 2010-05-31 18:41:28

Answer 3

A:

How about more straight handling of href's?

re_href = re.compile(r"""<\s*a(?:[^>]+?)href=("[^"]*(\\"[^"]*)*"|'[^']*(\\'[^']*)*'|[^\s>]*)[^>]*>""", re.I)

That takes about 0.007 seconds in comparsion with your findall which takes 38.694 seconds on my computer.

ony 2010-05-31 19:24:45

ansaurus

tags:

views:

answers:

Optimizing python link matching regular expression

related questions