views:

76

answers:

2

Using PHP to curl a web page (some URL entered by user, let's assume it's valid). Example: http://www.youtube.com/watch?v=Hovbx6rvBaA

I need to parse the HTML and extract all de-duplicated URL's that seem like an image. Not just the ones in img src="" but any URL ending in jpe?g|bmp|gif|png, etc. on that page. (In other words, I don't wanna parse the DOM but wanna use RegEx).

I plan to then curl the URLs for their width and height information and ensure that they are indeed images, so don't worry about security related stuff.

+1  A: 

Collect all image urls into an array, then use array_unique() to remove duplicates.

$my_image_links = array_unique( $my_image_links );
// No more duplicates

If you really want to do this w/ a regex, then we can assume each image name will be surrounded by either ', ", or spaces, tabs, or line breaks or beginning of line, >, <, and whatever else you can think of. So, then we can do:

$pattern = '/[\'" >\t^]([^\'" \n\r\t]+\.(jpe?g|bmp|gif|png))[\'" <\n\r\t]/i';
preg_match_all($pattern, html_entity_decode($resultFromCurl), $matches);
$imgs = array_unique($matches[1]);

The above will capture the image link in stuff like:

<p>Hai guys look at this ==> http://blah.com/lolcats.JPEG&lt;/p&gt;

Live example

Peter Ajtai
would be more logical to use a set structure and just not add the dupes.
Mark
+3  A: 

What's wrong with using the DOM? It gives you much better control over the context of the information and a much higher likelihood that the things you pull out are actually URLs.

<?php
$resultFromCurl = '
    <html>
    <body>
    <img src="hello.jpg" />
    <a href="yep.jpg">Yep</a>
    <table background="yep.jpg">
    </table>
    <p>
        Perhaps you should check out foo.jpg! I promise it 
        is safe for work.
    </p>
    </body>
    </html>
';

// these are all the attributes i could think of that
// can contain URLs.
$queries = array(
    '//table/@background',
    '//img/@src',
    '//input/@src',
    '//a/@href',
    '//area/@href',
    '//img/@longdesc',
);

$dom = @DOMDocument::loadHtml($resultFromCurl);
$xpath = new DOMXPath($dom);

$urls = array();
foreach ($queries as $query) {
    foreach ($xpath->query($query) as $link) {
        if (preg_match('@\.(gif|jpe?g|png)$@', $link->textContent))
            $urls[$link->textContent] = true;
    }
}

if (preg_match_all('@\b[^\s]+\.(?:gif|jpe?g|png)\b@', $dom->textContent, $matches)) {
    foreach ($matches as $m) {
        $urls[$m[0]] = true;
    }
}

$urls = array_keys($urls);
var_dump($urls);
Shabbyrobe
What about URLs that appear in text outside of attributes?
Mark Trapp
I've added that to the answer.
Shabbyrobe