ansaurus

Question

Finding anchor text when there are tags there

Answer 1

+2 A:

I would not use a regex - use an HTML parser like Beautiful Soup.

Andrew Hare 2009-03-02 17:32:17

Seems a bit heavyweight for such a simple problem

Teifion 2009-03-02 17:37:09

Never. HTML is highly irregular -- browsers are require to tolerate a large number of errors. Beautiful Soup can better processor irregular HTML than regexes can.

S.Lott 2009-03-02 18:04:05

Answer 2

+1 A:

Do a non greedy search i.e.

(.*?)

Andrew Cox 2009-03-02 17:32:35

It matches only until the tag within the anchor text

Teifion 2009-03-02 17:35:56

Answer 3

+2 A:

Instead of:

[^<>]*

Try:

((?!</a).)*

In other words, match any character that isn't the start of a </a sequence.

MarkusQ 2009-03-02 17:37:13

Thanks very much for the help :)

Teifion 2009-03-02 17:45:18

Answer 4

+1 A:

>>> import re
>>> pattern = re.compile(r'<a.+href=[\'|\"](.+)[\'|\"].*?>(.+)</a>', re.IGNORECASE)
>>> link = '<a href="http://stackoverflow.com/questions/603199/finding-anchor-text-when-there-are-tags-there"&gt;Finding anchor text when there are tags there</a>'
>>> re.match(pattern, link).group(1)
'http://stackoverflow.com/questions/603199/finding-anchor-text-when-there-are-tags-there'
>>> re.match(pattern, link).group(2)
'Finding anchor text when there are tags there'

Selinap 2009-03-03 00:13:46

ansaurus

tags:

views:

answers:

Finding anchor text when there are tags there

related questions