views:

267

answers:

3

I'm looking for a PHP preg_replace() solution find links to images and replace them with respective image tags.

Find:

<a href="http://www.domain.tld/any/valid/path/to/imagefile.ext"&gt;This will be ignored.</a>

Replace with:

<img src="http://www.domain.tld/any/valid/path/to/imagefile.ext" alt="imagefile" />

Where the protocol MUST be http://, the .ext MUST be a valid image format (.jpg, .jpeg, .gif, .png, .tif), and the base file name becomes the alt="" value.

I know preg_replace() is the right function for the job, but I suck with regex, so any help is greatly appreciated! THANKS!

+7  A: 

Ahh, my daily DOM practice. You should use DOM to parse HTML and regex to parse strings such as html attributes.

Note: I have some basic regexes that could surely be improved upon by some wizards :)

Note #2: Though it might be extra overhead you could use something like curl to thoroughly check if the href is an actual image by sending a HEAD request and looking at the Content-Type, but this would work in 80-90% of cases.

<?php

$content = '

<a href="http://www.domain.tld/any/valid/path/to/imagefile.ext"&gt;This will be ignored.</a>
<br>

<a href="http://col.stb.s-msn.com/i/43/A4711309495C88F8CD154C99FCE.jpg"&gt;this will not be ignored</a>

<br>

<a href="http://col.stb.s-msn.com/i/A0/8E9A454F701E4F5F89E58E14B532C.jpg"&gt;bah&lt;/a&gt;
';

$dom = new DOMDocument();
$dom->loadHTML($content);

$anchors = $dom->getElementsByTagName('a');

$i = $anchors->length-1;

$protocol = '/^http:\/\//';
$ext = '/([\w+]+)\.(?:gif|jpg|jpeg|png)$/';

if ( count($anchors->length) > 0 ) {
    while( $i > -1 ) {
    $anchor = $anchors->item($i);
    if ( $anchor->hasAttribute('href') ) {
        $link = $anchor->getAttribute('href');

        if ( 
     preg_match ( $protocol , $link ) &&
     preg_match ( $ext, $link )
        ) {
     //echo 'replacing this one.';
     $image = $dom->createElement('img');

     if ( preg_match( $ext, $link, $matches ) ) {
         if ( count($matches) ) {
      $altName = $matches[1];
      $image->setAttribute('alt', $altName);
         }
         $image->setAttribute('src', $link);
         $anchor->parentNode->replaceChild( $image, $anchor );
     }
        }

    }
    $i--;
    }
}

echo $dom->saveHTML();
meder
Too long... this can be done with a preg_replace. Look at my answer.
Seb
A regular expression solution is too prone to failing, I'll stick with DOM but thanks.
meder
Plus a DOM solution is far more flexible as you can do any DOM operation you want, you're limited in a regex replacement.
meder
+9  A: 

Congratulations, you are the one millionth customer to ask Stack Overflow how to parse HTML with regex!

[X][HT]ML is not a regular language and cannot reliably be parsed with regex. Use an HTML parser. PHP itself gives you DOMDocument, or you may prefer simplehtmldom.

Incidentally, you cannot tell what type a file is by looking at its URL. There is no reason a JPEG has to have ‘.jpeg’ as its extension — and indeed, no guarantee that a file with ‘.jpeg’ extension will actually be JPEG. The only way to be certain is to fetch the resource (eg. using a HEAD request) and look at the Content-Type header.

bobince
-1 This doesn't solve the problem; and no one cares about parsing HTML with a regex - if you're the one validating the images and creating the markup, then you can be fairly sure everything will work just fine.
Seb
Indeed. However, the questioner has not stated that the format of the markup is under their control.
bobince
Nor did he say it's not. You don't know anything about the context, so this should not be an answer but a comment to the question.
Seb
The formatted markup is under my control. This answer is almost irrelevant.
Dolph
+1  A: 

I would suggest using this more flexible non-greddy regex:

<a[^>]+?href=\"(http:\/\/[^\"]+?\/([^\"]*?)\.(jpg|jpeg|png|gif))[^>]*?>[^<]*?<\/a>

And a more complex regex (including PHP test code) to hopefully please Gumbo :)

<?php
$test_data = <<<END
<a blabla="asldlsaj" alksjada="aslkdj" href="http://www.domain.tld/any/valid/path/to/imagefile.jpg" lkjasd=""asdlaskjd>This will be ignored.</a>
Lorem ipsum..
<a    blabla=asldlsaj alksjada="aslkdj" href="http://www.domain.tld/any/valid/path/to/imagefile.jpg" lkjasd=""asdlaskjd>This will be ignored.</a>
<a lkjafs='asdsa> ' blabla="asldlksjada=>"aslkdj" href="http://www.domain.tld/any/valid/path/to/imagefile.jpg" lkjasd=""asdlaskjd>This will be ignored.</a>
<a    blabla="ajada="aslk href="http://www.domain.tld/any/valid/path&gt;/to/imagefile.jpg" lkjasd>asdlaskjd>This will be ignored.</a>
<a    blabla="asldlsaj>" aslkdj href="http://www.domain.tld/any/valid/path/ to/imagefile.jpg" lkjasd=""asdlaskjd>This will be ignored.</a>
Something:
<a    blabla='asldls<ajslkdj' href="http://www.domain.tld/any/valid'/path/to/imagefile.jpg" lkjasd=""asdlaskjd>This will be ignored.</a>
<a    blabla=  asldlsadj href="http://www.domain.tld/any/valid/path/to/imagefile.jpg" lkjasd>This will be ignored.</a>
<a blabla="asldlsaj" alksjslkdj" href='http://www.domain.tld/any/valid/path/to/imagefile.jpg' lkjasdskjd>This will be ignored.</a>
Something else...
<a    blabla="asldlsaj" alksjslkdj" href='http://www.domain.tld/any/valid/path/to/imagefile.jpg' lkjasdskjd>This will be ignored.</a>
<a    blabla="asldlsaj" alksjada="aslkdj" href=http://www.domain.tld/any/valid/path/to/imagefile.jpg lkjdlaskjdll> be ignored.</a>
END;
$regex = "/<a\s(\s*\w+(\s*=\s*(\".*?\"|'.*?'|[^'\">\s]+))?)+?\s+href\s*=\s*(\"(http:\/\/[^\"]+\/(.*?)\.(jpg|jpeg|png|gif))\"|'(http:\/\/[^']+\/(.*?)\.(jpg|jpeg|png|gif))'|(http:\/\/[^'\">\s]+\/([^'\">\s]+)\.(jpg|jpeg|png|gif)))\s(\s*\w+(\s*=\s*(\".*?\"|'.*?'|[^'\">\s]+))?)+>[^<]*?<\/a>/i";
$replaced = preg_replace($regex, '<img src="$5$8$11" alt="$6$9$12" />', $test_data);

echo '<pre>'.htmlentities($replaced);
?>
allanmc
Attribute values are allowed to contain a literal `>`.
Gumbo