tags:

views:

41

answers:

5

I need a regex that will give me the string inside an href tag and inside the quotes also.

For example i need to extract theurltoget.com in the following:

<a href="theurltoget.com">URL</a>

Additionally, I only want the base url part. I.e. from http://www.mydomain.com/page.html i only want http://www.mydomain.com/

+1  A: 
/href="(https?://[^/]*)/

I think you should be able to handle the rest.

Adam Byrtek
+1  A: 

This will handle the case where there are no quotes around the URL.

/<a [^>]*href="?([^">]+)"?>/

But seriously, do not parse HTML with regex. Use DOM or a proper parsing library.

kijin
A: 

http://www.the-art-of-web.com/php/parse-links/

Let's start with the simplest case - a well formatted link with no extra attributes:

/<a href=\"([^\"]*)\">(.*)<\/a>/iU
jnpcl
+1  A: 
$html = '<a href="http://www.mydomain.com/page.html">URL&lt;/a>';

$url = preg_match('/<a href="(.+)">/', $html, $match);

$info = parse_url($match[1]);

echo $info['scheme'].'://'.$info['host']; // http://www.mydomain.com
Alec
A: 

Dont use regex for this. You can use xpath and built in php functions to get what you want:

    $xml = simplexml_load_string($myHtml);
    $list = $xml->xpath("//@href");

    $preparedUrls = array();
    foreach($list as $item) {
        $item = parse_url($item);
        $preparedUrls[] = $item['scheme'] . '://' .  $item['host'] . '/';
    }
    print_r($preparedUrls);
Drew