You should really not use regex for that kind of things : HTML is quite not regular enough for regular expressions...
What if you have, for instance, one of those :
<img src="..." />
<img src='...' />
<img src="...">
<img src="..." alt="..." />
<img alt="..." src="..." />
<img alt="..." src="..." style="..." />
Instead, you should use an HTML parser, like, for instance, DOMDocument::loadHTML
.
With that, once your HTML document is loaded as a DOMDocument, you can use Xpath queries, or go through the DOM, to extract the specific informations you need.
For instance, considering you have this portion of HTML in a variable :
$html = <<<HTML
<p>test</p>
<img src="http://www.example.com/image-1.png" />
plop glop
<img alt="booh" src="http://www.example.com/image-2.png" />
huhu ?
<img alt="booh again" src='http://www.example.com/image-3.jpg' />
HTML;
You could :
- Instanciate
DOMDocument
- load the HTML from the variable
- use the
getElementsByTagName
method to get all img
tags
- and get the
src
attribute of each one, with the getAttribute
method
Which means some code like this one :
$dom = new DOMDocument();
$dom->loadHTML($html);
$nodes = $dom->getElementsByTagName('img');
foreach ($nodes as $img) {
var_dump($img->getAttribute('src'));
}
And the ouput you get will look like this :
string 'http://www.example.com/image-1.png' (length=34)
string 'http://www.example.com/image-2.png' (length=34)
string 'http://www.example.com/image-3.jpg' (length=34)
Not really hard to write -- and it should work much better than regexes, when it comes to extracting data from an HTML document !