tags:

views:

56

answers:

2

I want to match both the src and title attributes of an image tag:

pattern:

<img [^>]*src=["|\']([^"|\']+["|\'])|title=["|\']([^"|\']+)

target:

<img src="http://someurl.jpg" class="quiz_caption" title="Caption goes here!">

This pattern gives me one unwanted match, title="content", and the match I actually want which is the value between the quotes after the word 'title', i.e 'content'.

So, my matches are:

<img src="http://someurl.jpg
http://someurl.jpg
title="Caption goes here!"
Caption goes here!

Is there a way to avoid the third of these matches? I'm using PCRE in PHP 5.2.x

+3  A: 

You can't parse HTML with regular expressions, unless you know you're dealing with a subset of HTML. Your regex, even if correct, would fail e.g. if any of the attributes had a > character.

With the DOM extension:

<?php
$target = <<<EOD
<img src="http://someurl.jpg" class="quiz_caption" title="Caption goes here!">
EOD;

$d = new DOMDocument();
$d->loadHTML($target);
$img = $d->getElementsByTagName("img");

echo $img->item(0)->getAttribute("src") . "\n";
echo $img->item(0)->getAttribute("title") . "\n";
Artefacto
+1, read this if you're not convinced : http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
greg0ire
+1  A: 

If you know exactly what you are looking for you could try this:

src="(.+?)"|title="(.+?)"

I would also recommend you to do some playing at http://gskinner.com/RegExr/ which is an online regExr in flash ... it can help you improving your knowledge and also it has many pre-built expressions by the community.

Prix