tags:

views:

154

answers:

4

I want to find the text between a pair of <a> tags that link to a given site

Here's the re string that I'm using to find the content:

r'''(<a([^<>]*)href=("|')(http://)?(www\.)?%s([^'"]*)("|')([^&lt;&gt;]*)&gt;([^&lt;]*))&lt;/a&gt;''' % our_url

The result will be something like this:

r'''(<a([^<>]*)href=("|')(http://)?(www\.)?stackoverflow.com([^'"]*)("|')([^&lt;&gt;]*)&gt;([^&lt;]*))&lt;/a&gt;'''

This is great for most links but it errors with a link with tags within it. I tried changing the final part of the regex from:

([^<]*))</a>'''

to:

(.*))</a>'''

But that just got everything on the page after the link, which I don't want. Are there any suggestions on what I can do to solve this?

+2  A: 

I would not use a regex - use an HTML parser like Beautiful Soup.

Andrew Hare
Seems a bit heavyweight for such a simple problem
Teifion
Never. HTML is highly irregular -- browsers are require to tolerate a large number of errors. Beautiful Soup can better processor irregular HTML than regexes can.
S.Lott
+1  A: 

Do a non greedy search i.e.

(.*?)
Andrew Cox
It matches only until the tag within the anchor text
Teifion
+2  A: 

Instead of:

[^<>]*

Try:

((?!</a).)*

In other words, match any character that isn't the start of a </a sequence.

MarkusQ
Thanks very much for the help :)
Teifion
+1  A: 
>>> import re
>>> pattern = re.compile(r'<a.+href=[\'|\"](.+)[\'|\"].*?>(.+)</a>', re.IGNORECASE)
>>> link = '<a href="http://stackoverflow.com/questions/603199/finding-anchor-text-when-there-are-tags-there"&gt;Finding anchor text when there are tags there</a>'
>>> re.match(pattern, link).group(1)
'http://stackoverflow.com/questions/603199/finding-anchor-text-when-there-are-tags-there'
>>> re.match(pattern, link).group(2)
'Finding anchor text when there are tags there'
Selinap