ansaurus

Question

Answer 1

A:

You're only matching for double quotes. Try

/(<img).*(src\s*=\s*("|')([a-zA-Z0-9\.;:\/\?&=\-_|\r|\n]{1,})\3)/isxmU

Note that the first " has been replaced by ("|') and that the ending check is using the backreference to that group, \3.

Have you considered using an HTML parser to do this, instead?

jasonbar 2010-03-11 15:59:36

Answer 2

A:

You should really not use regex for that kind of things : HTML is quite not regular enough for regular expressions...

What if you have, for instance, one of those :

<img src="..." />
<img src='...' />
<img src="...">
<img src="..." alt="..." />
<img alt="..." src="..." />
<img alt="..." src="..." style="..." />

Instead, you should use an HTML parser, like, for instance, DOMDocument::loadHTML.

With that, once your HTML document is loaded as a DOMDocument, you can use Xpath queries, or go through the DOM, to extract the specific informations you need.

For instance, considering you have this portion of HTML in a variable :

$html = <<<HTML
<p>test</p>
<img src="http://www.example.com/image-1.png" />
plop glop
<img alt="booh" src="http://www.example.com/image-2.png" />
huhu ?
<img alt="booh again" src='http://www.example.com/image-3.jpg' />
HTML;

You could :

Instanciate DOMDocument
load the HTML from the variable
use the getElementsByTagName method to get all img tags
and get the src attribute of each one, with the getAttribute method

Which means some code like this one :

$dom = new DOMDocument();
$dom->loadHTML($html);
$nodes = $dom->getElementsByTagName('img');
foreach ($nodes as $img) {
  var_dump($img->getAttribute('src'));
}

And the ouput you get will look like this :

string 'http://www.example.com/image-1.png' (length=34)
string 'http://www.example.com/image-2.png' (length=34)
string 'http://www.example.com/image-3.jpg' (length=34)

Not really hard to write -- and it should work much better than regexes, when it comes to extracting data from an HTML document !

Pascal MARTIN 2010-03-11 17:06:53

Answer 3

A:

I ended up using http://simplehtmldom.sourceforge.net/. Was quick and easy.

David 2010-03-16 11:05:16

ansaurus

tags:

views:

answers:

preg_match_all image source

related questions