tags:

views:

110

answers:

8

Hi there

Trying to find the links on a page.

my regex is:

/<a\s[^>]*href=(\"\'??)([^\"\' >]*?)[^>]*>(.*)<\/a>/

but seems to fail at

<a title="this" href="that">what?</a>

how would I change my regex to deal with href not placed first in the a tag?

thanks!

A: 

why don't you just match

"<a.*?href\s*=\s*['"](.*?)['"]"

<?php

$str = '<a title="this" href="that">what?</a>';

$res = array();

preg_match_all("/<a.*?href\s*=\s*['\"](.*?)['\"]/", $str, $res);

var_dump($res);

?>

then

$ php test.php
array(2) {
  [0]=>
  array(1) {
    [0]=>
    string(27) "<a title="this" href="that""
  }
  [1]=>
  array(1) {
    [0]=>
    string(4) "that"
  }
}

which works. I've just removed the first capture braces.

Aif
doesnt work, sorry
bergin
Editing answer.
Aif
A: 

The pattern you want to look for would be the link anchor pattern, like (something):

$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";
Alexander.Plutov
A: 

I made a test here http://www.spaweditor.com/scripts/regex/index.php. It finds regex, just remove semicolon. What is your intention?

Gadolin
oopps sorry, the " "; was a mistake
bergin
A: 

Quick test: <a\s+[^>]*href=(\"\'??)([^\1]+)(?:\1)>(.*)<\/a> seems to do the trick, with the 1st match being " or ', the second the 'href' value 'that', and the third the 'what?'.

The reason I left the first match of "/' in there is that you can use it to backreference it later for the closing "/' so it's the same.

See live example on: http://www.rubular.com/r/jsKyK2b6do

CharlesLeaf
doesnt work, sorry
bergin
@bergin please specify, what doesn't work? I get the exact value from the href in your test HTML. What are you expecting that this doesn't do? I see you use a different site for testing, there I also get the 'href' value succesfully from your example. http://www.myregextester.com/?r=d966dd6b
CharlesLeaf
A: 

I'm not sure what you're trying to do here, but if you're trying to validate the link then look at PHP's filter_var()

If you really need to use a regular expression then check out this tool, it may help: http://regex.larsolavtorvik.com/

Adam
A: 

Using your regex, I modified it a bit to suit your need.

<a.*?href=("|')(.*?)("|').*?>(.*)<\/a>

I personally suggest you use a HTML Parser

EDIT: Tested

Ruel
using myregextester.com - sorry, doesnt find the links
bergin
@bergin, Hi I modified my answer, and it works now.
Ruel
it says: NO MATCHES. CHECK FOR DELIMITER COLLISION.
bergin
Can you please tell me the text to match? I use: `<a title="this" href="that">what?</a>`
Ruel
+6  A: 

HTML is not regular. Here is how to do it with DOM:

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach( $dom->getElementsByTagName('a') as $node ) {
    echo $dom->saveXml($node), PHP_EOL;
}

The above would find and output the "outerHTML" of all A elements in the $html string.

To get the href attribute you'd do

echo $node->getAttribute( 'href' );

See the linked blog post for an alternate approach with XPath and further information on how to handle any errors stemming from invalid markup.

Also see:

On a sidenote: I am sure this is a duplicate and you can find the answer somewhere in here

Gordon
I'm upvoting you because apparently the working regexes (for his example) don't work, and you are right that you should avoid regex for parsing HTML. Altho your example doesn't deliver what he asks for (hint: getAttribute), it's a good step in the right direction.
CharlesLeaf
+1  A: 

I agree with Gordon, you MUST use an HTML parser to parse HTML. But if you really want a regex you can try this one :

/^<a.*?href=(["\'])(.*?)\1.*$/

This matches <a at the begining of the string, followed by any number of any char (non greedy) .*? then href= followed by the link surrounded by either " or '

$str = '<a title="this" href="that">what?</a>';
preg_match('/^<a.*?href=(["\'])(.*?)\1.*$/', $str, $m);
var_dump($m);

Output:

array(3) {
  [0]=>
  string(37) "<a title="this" href="that">what?</a>"
  [1]=>
  string(1) """
  [2]=>
  string(4) "that"
}
M42