views:

362

answers:

3

I have the following regex expression which is to extract the source of any img tag in HTML.

/(<img).*(src\s*=\s*"([a-zA-Z0-9\.;:\/\?&=\-_|\r|\n]{1,})")/isxmU

However, it doesn't appear to be matching the following:

<IMG SRC='http://www.mysite.com/pix/lens/mtf/CAEF8512L.gif'&gt;

How can I build it to match this as well?

A: 

You're only matching for double quotes. Try

/(<img).*(src\s*=\s*("|')([a-zA-Z0-9\.;:\/\?&=\-_|\r|\n]{1,})\3)/isxmU

Note that the first " has been replaced by ("|') and that the ending check is using the backreference to that group, \3.

Have you considered using an HTML parser to do this, instead?

jasonbar
A: 

You should really not use regex for that kind of things : HTML is quite not regular enough for regular expressions...

What if you have, for instance, one of those :

<img src="..." />
<img src='...' />
<img src="...">
<img src="..." alt="..." />
<img alt="..." src="..." />
<img alt="..." src="..." style="..." />


Instead, you should use an HTML parser, like, for instance, DOMDocument::loadHTML.

With that, once your HTML document is loaded as a DOMDocument, you can use Xpath queries, or go through the DOM, to extract the specific informations you need.


For instance, considering you have this portion of HTML in a variable :

$html = <<<HTML
<p>test</p>
<img src="http://www.example.com/image-1.png" />
plop glop
<img alt="booh" src="http://www.example.com/image-2.png" />
huhu ?
<img alt="booh again" src='http://www.example.com/image-3.jpg' />
HTML;

You could :

  • Instanciate DOMDocument
  • load the HTML from the variable
  • use the getElementsByTagName method to get all img tags
  • and get the src attribute of each one, with the getAttribute method

Which means some code like this one :

$dom = new DOMDocument();
$dom->loadHTML($html);
$nodes = $dom->getElementsByTagName('img');
foreach ($nodes as $img) {
  var_dump($img->getAttribute('src'));
}

And the ouput you get will look like this :

string 'http://www.example.com/image-1.png' (length=34)
string 'http://www.example.com/image-2.png' (length=34)
string 'http://www.example.com/image-3.jpg' (length=34)


Not really hard to write -- and it should work much better than regexes, when it comes to extracting data from an HTML document !

Pascal MARTIN
A: 

I ended up using http://simplehtmldom.sourceforge.net/. Was quick and easy.

David