ansaurus

Question

regexp for finding everything between <a> and </a> tags

Answer 1

A:

Regex, the black magic, again :)

I found one nice question about common regex. There some interesting links where you will find very common regexpressions like yours.

Grabbing HTML Tags

< TAG\b[^>]>(.?) Analyze this regular expression with RegexBuddy matches the opening and closing pair of a specific HTML tag. Anything between the tags is captured into the first backreference. The question mark in the regex makes the star lazy, to make sure it stops before the first closing tag rather than before the last, like a greedy star would do. This regex will not properly match tags nested inside themselves, like in onetwoone.

<([A-Z][A-Z0-9])\b[^>]>(.*?) Analyze this regular expression with RegexBuddy will match the opening and closing pair of any HTML tag. Be sure to turn off case sensitivity. The key in this solution is the use of the backreference \1 in the regex. Anything between the tags is captured into the second backreference. This solution will also not match tags nested in themselves.

Otherwise: Browse this link: keyword "link". There are some interesting approaches to filter links.

I hope this helps :)

Good luck!

furtelwart 2008-12-05 07:33:28

Answer 2

+6 A:

The standard disclaimer applies: Parsing HTML with regular expressions is not ideal. Success depends on the well-formedness of the input on a character-by-character level. If you cannot guarantee this, the regex will fail to do the Right Thing at some point.

Having said that:

<a\b[^>]*>(.*?)</a>   // match group one will contain the link text

Tomalak 2008-12-05 08:10:15

+1. Would be <a[^>]*>([^<]*?)</a> event better ?

e-satis 2008-12-05 09:11:14

This will match on any tag starting with "a", up to a /a. <a\s*(.*)>(.*)</a> will single out a tags

Xetius 2008-12-05 09:17:45

HTML 4.01 / XHTML 1.0 defines a, abbr, acronym, address, applet and area tags which will all match

Xetius 2008-12-05 09:22:38

That's right. I added \b to avoid this scenario. @e-satis: non greedy matching is not necessary here, "[^<]*" will stop at the first "<" encountered.

Tomalak 2008-12-05 10:49:29

If regex isn't the best way to find everything between <a> and </a> what is?

gaoshan88 2008-12-05 19:55:19

Thanks for your help. I realized that regexp isn't the best way to do it. I'm trying out the PHP html parser suggested by slim.

Vikram Haer 2008-12-05 23:44:49

Then you should accept his answer.

Tomalak 2008-12-06 14:48:59

@Tomalak : I agree, but I think it's faster for the regex engine to eveluate "not <" that "among an ensemble with anything".

e-satis 2008-12-07 11:19:38

@e-satis: From a performance point of view, "[^<]*?" is worse than "[^<]*", because the former forces the engine to do a "do I *really* have to match again?"-check every time, whereas the latter produces the equivalent result right away. The non-greedy quantifier is adding complexity at no benefit.

Tomalak 2008-12-07 12:54:47

Answer 3

A:

Well.. Using regular expressions is not perfect, but in perl regexp,

m!<a .*?>(.*?)</a>!i

should give you the name of the first link on that line in match group one, ignoring case.

Limitations:

Does not handle multiple links on one line
Does not handle links going over several lines.
Will also match on anchor tags.

You could work around this by joining all lines into one line and then split it into an array (or multiple lines) using the link start as separator.

Jørn Jensen 2008-12-05 08:49:04

Answer 4

+1 A:

<a\s*(.*)\>(.*)</a>

<a href="http://www.stackoverflow.com"&gt;Go to stackoverflow.com</a>

$1 = href="www.stackoverflow.com"

$2 = Go to stackoverflow.com

I answered a similar question to strip everything except a tags here

Xetius 2008-12-05 09:13:54

I changed my answer to account for this scenario, thanks for the hint. Nevertheless, your "(.*)" is wrong because of the greedy star.

Tomalak 2008-12-05 11:18:28

Answer 5

A:

Hey guys, i've fund a very useful tool to generate regular expressions, you should take a look!!

Expresso

Greetz, ZappeL

2008-12-05 10:45:13

Very nice tool...

Bruno 2008-12-05 10:46:38

Answer 6

+2 A:

I'm a big fan of regexes, but this is not the right place to use them.

Use a real HTML parser.

Your code will be clearer
It will be more likely to work

I Googled for a PHP HTML parser, and found this one.

If you know you're working with XHTML, then you could use PHP's standard XML parser.

slim 2008-12-05 11:00:55

ansaurus

tags:

views:

answers:

regexp for finding everything between <a> and </a> tags

related questions