views:

657

answers:

3

Given the following code :

<body>
  <img src="source.jpg" />
  <p>
    <img src="source.jpg" id ="hello" alt="nothing" />
    <img src="source.jpg" id ="world"/>
  </p>
</body>

What's the best way - using a regular expression (or better?) - to replace it so it becomes this:

<body>
  <img src="source.jpg" id="img_0" />
  <p>
    <img src="source.jpg" id ="img_1"  alt="nothing" />
    <img src="source.jpg" id ="img_2"/>
  </p>
</body>

In other words :

  • All the <image /> tags all gets populated by an id attribute.

  • The id attribute should contain an incremented attribute (this is not really the problem though as its just part of the replace procedure)

I guess two passes are needed, one to remove all the existent id attributes and another to populate with new ones ?

+1  A: 

With appropriate escaping (that I can never remember without trial and error), and something to increment the img_number, you want to replace something like this:

(<img .*?)(?:id=".*")?(.*?/>)

with something like this this:

\1 id="img_$i"\2

Sparr
(<img .*?)(id=".*")?(.*?/>) would work better I think...
David Zaslavsky
Not sure if you wrote that before I fixed the syntax... the ?: makes the middle group non-capturing, which speeds regex execution on fast platforms.
Sparr
+1  A: 

I think the best approach is to use preg_replace_callback.

Also I would recommend a slightly more stringent regexp than those suggested so far - what if your page contains an <img /> tag that does not contain an id attribute?

$page = '
<body>
  <img src="source.jpg" />
  <p>
 <img src="source.jpg" id ="hello" alt="nothing" />
 <img src="source.jpg" id ="world"/>
  </p>
</body>';

function my_callback($matches)
{
 static $i = 0;
 return $matches[1]."img_".$i++;
}

print preg_replace_callback('/(<img[^>]*id\s*=\s*")([^"]*)/', "my_callback", $page);

Which produces the following for me:

<body>
  <img src="source.jpg" />
  <p>
 <img src="source.jpg" id ="img_0" alt="nothing" />
 <img src="source.jpg" id ="img_1"/>
  </p>
</body>

The regexp has two capturing groups, the first we preserve, the second we replace. I've used lots of negative character classes (e.g. [^>]* = up to closing >) to make sure that <img /> tags arn't required to have id attributes.

RobM
+3  A: 
<?php
$data = <<<DATA
<body>
  <img src="source.jpg" />
  <p>
    <img src="source.jpg" id ="hello" alt="nothing" />
    <img src="source.jpg" id ="world"/>
  </p>
</body>
DATA;

$doc = new DOMDocument('1.0', 'UTF-8');
$doc->strictErrorChecking = true;
$doc->standalone = true;
$doc->xmlStandalone = true;
$doc->formatOutput = true;
$doc->loadXML($data, LIBXML_NOWARNING | LIBXML_NOERROR);

$sNode = $doc->getElementsByTagName("img");

$id = 0;
foreach($sNode as $searchNode)
{
  $searchNode->setAttribute('id', "img_$id");
  $doc->importNode($searchNode);
  $id++;
}

$result = $doc->saveHTML();
echo $result;
raspi
+1 for actually showing a non-regex solution
Daniel Vandersluis