views:

651

answers:

2

I am trying to scrape img src's with php, I can get the src fine, but if the src does not include the full path then I can't really reuse it. Is there a way to grab the full path of the image using php (browsers can get it if you use the right click menu).

ie. How do I get a FULL path including the domain in one of the following two examples?

src="../foo/logo.png"
src="/images/logo.png"

Thanks,

Allan

+2  A: 

Unless you have the site URL you're starting with (in which case you can prepend it to the value of the src attribute) it seems like all you're left with there is a string.

I'm assuming you don't have access to any additional information of course. If you're parsing HTML, I'd assume you must be able to access an absolute URL to at least the HTML page, but perhaps not.

James Inman
Yeah, someone enters a url into a form, which is biffed into this script, which chucks things into a DB, which is called from another page, so I could prepend the domain, but I was wondering if there were a more elegant solution. Regex is not my favorite passtime.
Allansideas
+2  A: 

You don't need a regex... just some patience. I don't really want to write the code for you, but just check if the src starts with http://, and if not, you have like 3 different cases.

  1. If it begins with a / then prepend http://domain.com
  2. If it begins with .. you'll have to split the full URL and hack off pieces until the src starts with a /
  3. Else (it begins with a letter), the take the full domain, and strip it down to the last slash then append the src URL.

Or.... be lazy and steal this script

$url = "http://www.goat.com/money/dave.html";
$rel = "../images/cheese.jpg";

$com = InternetCombineURL($url,$rel);

//  Returns http://www.goat.com/images/cheese.jpg

function InternetCombineUrl($absolute, $relative) {
    $p = parse_url($relative);
    if($p["scheme"])return $relative;

    extract(parse_url($absolute));

    $path = dirname($path); 

    if($relative{0} == '/') {
        $cparts = array_filter(explode("/", $relative));
    }
    else {
        $aparts = array_filter(explode("/", $path));
        $rparts = array_filter(explode("/", $relative));
        $cparts = array_merge($aparts, $rparts);
        foreach($cparts as $i => $part) {
            if($part == '.') {
                $cparts[$i] = null;
            }
            if($part == '..') {
                $cparts[$i - 1] = null;
                $cparts[$i] = null;
            }
        }
        $cparts = array_filter($cparts);
    }
    $path = implode("/", $cparts);
    $url = "";
    if($scheme) {
        $url = "$scheme://";
    }
    if($user) {
        $url .= "$user";
        if($pass) {
            $url .= ":$pass";
        }
        $url .= "@";
    }
    if($host) {
        $url .= "$host/";
    }
    $url .= $path;
    return $url;
}

From http://www.web-max.ca/PHP/misc_24.php

Mark
Perfect Thanks!
Allansideas
You have not considered the case with BASE tag: http://www.w3.org/TR/html401/struct/links.html#h-12.4
Viet
@Viet: Good point. Not too hard to factor in though.
Mark