views:

3027

answers:

6

Hi,

I'm trying to find a way to make a list of everything between <a> and </a> tags. So I have a list of links and I want to get the names of the links (not where the links go, but what they're called on the page). Would be really helpful to me.

Currently I have this:

$lines = preg_split("/\r?\n|\r/", $content);  // content is the given page
foreach ($lines as $val) {
  if (preg_match("/(<A(.*)>)(<\/A>)/", $val, $alink)) {     
    $newurl = $alink[1];

    // put in array of found links
    $links[$index] = $newurl;
    $index++;
    $is_href = true;
  }
}
A: 

Regex, the black magic, again :)

I found one nice question about common regex. There some interesting links where you will find very common regexpressions like yours.

Grabbing HTML Tags

< TAG\b[^>]>(.?) Analyze this regular expression with RegexBuddy matches the opening and closing pair of a specific HTML tag. Anything between the tags is captured into the first backreference. The question mark in the regex makes the star lazy, to make sure it stops before the first closing tag rather than before the last, like a greedy star would do. This regex will not properly match tags nested inside themselves, like in onetwoone.

<([A-Z][A-Z0-9])\b[^>]>(.*?) Analyze this regular expression with RegexBuddy will match the opening and closing pair of any HTML tag. Be sure to turn off case sensitivity. The key in this solution is the use of the backreference \1 in the regex. Anything between the tags is captured into the second backreference. This solution will also not match tags nested in themselves.

Otherwise: Browse this link: keyword "link". There are some interesting approaches to filter links.

I hope this helps :)

Good luck!

furtelwart
+6  A: 

The standard disclaimer applies: Parsing HTML with regular expressions is not ideal. Success depends on the well-formedness of the input on a character-by-character level. If you cannot guarantee this, the regex will fail to do the Right Thing at some point.

Having said that:

<a\b[^>]*>(.*?)</a>   // match group one will contain the link text
Tomalak
+1. Would be <a[^>]*>([^<]*?)</a> event better ?
e-satis
This will match on any tag starting with "a", up to a /a. <a\s*(.*)>(.*)</a> will single out a tags
Xetius
HTML 4.01 / XHTML 1.0 defines a, abbr, acronym, address, applet and area tags which will all match
Xetius
That's right. I added \b to avoid this scenario. @e-satis: non greedy matching is not necessary here, "[^<]*" will stop at the first "<" encountered.
Tomalak
If regex isn't the best way to find everything between <a> and </a> what is?
gaoshan88
Thanks for your help. I realized that regexp isn't the best way to do it. I'm trying out the PHP html parser suggested by slim.
Vikram Haer
Then you should accept his answer.
Tomalak
@Tomalak : I agree, but I think it's faster for the regex engine to eveluate "not <" that "among an ensemble with anything".
e-satis
@e-satis: From a performance point of view, "[^<]*?" is worse than "[^<]*", because the former forces the engine to do a "do I *really* have to match again?"-check every time, whereas the latter produces the equivalent result right away. The non-greedy quantifier is adding complexity at no benefit.
Tomalak
A: 

Well.. Using regular expressions is not perfect, but in perl regexp,

m!<a .*?>(.*?)</a>!i

should give you the name of the first link on that line in match group one, ignoring case.

Limitations:

  • Does not handle multiple links on one line
  • Does not handle links going over several lines.
  • Will also match on anchor tags.

You could work around this by joining all lines into one line and then split it into an array (or multiple lines) using the link start as separator.

Jørn Jensen
+1  A: 
<a\s*(.*)\>(.*)</a>

<a href="http://www.stackoverflow.com"&gt;Go to stackoverflow.com</a>

$1 = href="www.stackoverflow.com"

$2 = Go to stackoverflow.com

I answered a similar question to strip everything except a tags here

Xetius
I changed my answer to account for this scenario, thanks for the hint. Nevertheless, your "(.*)" is wrong because of the greedy star.
Tomalak
A: 

Hey guys, i've fund a very useful tool to generate regular expressions, you should take a look!!

Expresso

Greetz, ZappeL

Very nice tool...
Bruno
+2  A: 

I'm a big fan of regexes, but this is not the right place to use them.

Use a real HTML parser.

  • Your code will be clearer
  • It will be more likely to work

I Googled for a PHP HTML parser, and found this one.

If you know you're working with XHTML, then you could use PHP's standard XML parser.

slim