tags:

views:

93

answers:

5

I have the following code below on my website. It's used to find the images in a block of html that don't have http:// or / in front. If this is the case, it will add the website url to the front of the image source.

For example:

<img src="http://domain.com/image.jpg"&gt; will stay the same
<img src="/image.jpg"> will stay the same
<img src="image.jpg"> will be changed to <img src="http://domain.com/image.jpg"&gt;

I feel my code is really inefficient... Any ideas on how I could make it run with less code?

preg_match_all('/<img[\s]+[^>]*src\s*=\s*[\"\']?([^\'\" >]+)[\'\" >]/i', $content_text, $matches);
if (isset($matches[1])) {
  foreach($matches[1] AS $link) {
    if (!preg_match("/^(https?|ftp)\:\/\//sie", $link) && !preg_match("/^\//sie", $link)) {
      $full_link = get_option('siteurl') . '/' . $link;
      $content_text = str_replace($link, $full_link, $content_text);
    }
  }
}
+4  A: 

For a start you could stop using regular expressions to process HTML, particularly when what you're doing is so easily done with an HTML parser (of which PHP has at least 3). For example:

$dom = new DomDocoument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
  $src = $image->getAttribute('src');
  $url = parse_url($src);
  $image->setAttribute('src', http_build_url('http://www.mydomain.com', $url);
}
$html = $dom->saveHTML();

Problem solved. Well, almost. The case where you add the hostname to relative URLs but not to those beginning with / is a little puzzling and not handled in this snippet but it's a relatively minor change (it involves checking $url['path']).

See Parse HTML With PHP And DOM, the Document Object Model, parse_url() and http_build_url(). PHP has much better tools for this than regular expressions.

Oh and for good measure read Parsing Html The Cthulhu Way.

cletus
A: 

Trying to match HTML with regular expressions is very difficult.

Even though your code may seem to work, there is a good chance that some IMG tags will slip through as they are not in the exact format you have described.

Jon Winstanley
A: 

This isn't tested, but I'm thinking something like this...

preg_match_all('/<img\b[^>]*\bsrc\s*=\s*[\'"]?([^\'">]*)/i', $content_text, $matches);
Matt Huggins
+4  A: 

Maybe a completely different approach may work, too:

<base href="http://domain.com/" />

Martin
Oh man. I never knew about this tag. Thanks for posting a reference to it.
Platinum Azure
A: 

Now, all the cool kids are going to tell you not to use regex to parse html. This is mostly because of HTML's tree context. While I usually agree with the cool kids, a simple replace like what you're doing is perfectly fine for regex. In fact I would consider it a waste of resources to bother throwing DomDocument (or any other parser) at this problem.

Here's an easy one-liner for what you want:

preg_replace('/(<img[^>]*)src="([^\/])([^"]*")/', '$1src="http://domain.com/$2$3', $input);
Matt