tags:

views:

202

answers:

2

I want to use regex to replace src html attributes. The HTML is not malformed and fortunately takes the same form in all the pages in the database - i.e.

<img src="http://x.y/z/1.png" />

I have code that works fine if there's only one image in the page. I want to know the best way to replace multiple images, as this one will replace all the image tags with the same string.

$result = $s->db_query("SELECT reviewFullText as f FROM reviews WHERE reviewsID = 155");
while($row = mysql_fetch_array($result))
{
    $body = stripslashes(html_entity_decode($row['f'], ENT_NOQUOTES, "UTF-8"));
    preg_match_all('/<img.*?(src\=[\'|"]{0,1}.*?[\'|"]{0,1})[\s|>]{1}/i', $body, $matches);
    for($i=0;$i<count($matches[0]);$i++)
    {
     $number = preg_replace("/[^0-9]/", '', $matches[0][$i]);
     echo preg_replace('/<img.*?(src\=[\'|"]{0,1}.*?[\'|"]{0,1})[\s|>]{1}/i', '<img src="http://x.y/a/' . $number . '.png"', $matches[0][$i]);
    }
}

So if the page contains two files, one called 1.png and one called 2.png the script should parse the numbers and replace them with a different url such as http://x.y/a/1.png and http://x.y/a/2.png.

I've heard preg_replace_callback is the best way to do this but I have no idea how to get this working... Help!

+14  A: 

Don't use regular expressions for irregular languages like HTML. Use a parser instead. It will save you a lot of time and pain.

# Untested code:
$xml = new SimpleXml($xmlString);
foreach ($xml->xpath('//img') as $imgNode) {
    $imgNode->addAttribute('src', "http://x.y/a/" . $imgNode->getAttribute('src'));
}
echo $xml->asXML();

Note that you will need something like DOMDocument::loadHtml(), if your html is not xhtml (i.e. valid xml), but the idea remains the same.

soulmerge
+1 regex is not at all suitable for processing [X][HT]ML. However shouldn't the XPath be `//img`? DOM getElementsByTagName would work fine too. I have no idea what the `stripslashes(html_entity_decode())` over the whole document is supposed to achieve in the original code; this will only mangle the document.
bobince
@bobince: Thanks for pointing out the '//img' error. I think the `stripslashes(...` part is for 'sanitizing' the value (which might be a good indication that the storage/retrieval of the document needs a re-design.)
soulmerge
-1 ignoring the specific question.. as a rule you don't want to use regexp, but he clearly stated that all the elements he wants to replace look exactly the same, so for this case, regexp is a better solution.
amikazmi
Ok, altered answer to exactly answer the specific question. But even if all img tags look the same, using RE is not a good idea for several reasons. 1.) The RE would even replace non-nodes (in javascript tags, for example). 2.) It is more error-prone and harder to debug, than the PHP code, thus 3.) it is harder to maintain than PHP code.
soulmerge
Seeing as you led me on the right track I'll mark this as accepted. Also thanks to TrueWill, that post has a good library which I've used to do what I needed to do. Thanks all. :-)
different
+1  A: 

Add the global replace flag "g" in your regex.

'/your_regex/i**g**'

As soulmerge suggested, Since your html is not malformed(I assume you mean it is well-formed XML), An XSLT transformation would be an effective way to alter anything in your document too. You could match on the @src attribute and alter it as per your requirements.

You can also match on any other tags / attributes if you need to alter some other parts of the document at the same time.

Thiyagaraj