tags:

views:

57

answers:

2

I need a way to take a block of HTML code and make all URLs absolute. I've tried to adopt various regex examples out there but had no luck. These are the requirements:

  • Replace both HREF and SRC urls
  • If URL is already absolute, leave it
  • If URL is absolute, replace it

Each HTML comes from a known URL (example.com/folder/file.html) which can be used to create the absolute URLs. For example:

src="image.png" becomes src="http://example.com/folder/image.png" href="/home.html" becomes href="http://example.com/home.html"

I have found a function which does exactly what I need:

http://nashruddin.com/PHP_Script_for_Converting_Relative_to_Absolute_URL

But I can't figure out how to do it in bulk, for all URLs in a block of code.

Any help would be great!

Cheers.

+1  A: 

Don't use regular expressions to parse (X)HTML — what you want to do is to use an SGML or XML parser, and use a regular expression on the relevant element attributes instead.

You
Hehe. Point taken. At the moment, this is an experimental hack, but I'll definitely be considering a proper parser in the future. Thanks.
Peter Watts
+1  A: 

something like this may work

  $html = preg_replace_callback(
      '~((href|src)\s*=\s*[\"\'])([^\"\']+)~i', 
      'replace', 
      $html);

  function replace($x) {
     $url = $x[3];
     $url = your_url_conversion_function($url);
     return $x[1] . $url;
  }

this will fail if your html contains "href" or "src" outside tags, as in <h1> how to use "src=" </h1>. That's why people usually suggest dedicated parsers, and not regexps, for html.

stereofrog
As far as I can see, this works a treat. I know that regex in html will never be perfect, but this will do the trick for the time being. Thanks for the speedy reply!
Peter Watts