ansaurus

Question

Answer 1

+1 A:

Collect all image urls into an array, then use array_unique() to remove duplicates.

$my_image_links = array_unique( $my_image_links );
// No more duplicates

If you really want to do this w/ a regex, then we can assume each image name will be surrounded by either ', ", or spaces, tabs, or line breaks or beginning of line, >, <, and whatever else you can think of. So, then we can do:

$pattern = '/[\'" >\t^]([^\'" \n\r\t]+\.(jpe?g|bmp|gif|png))[\'" <\n\r\t]/i';
preg_match_all($pattern, html_entity_decode($resultFromCurl), $matches);
$imgs = array_unique($matches[1]);

The above will capture the image link in stuff like:

<p>Hai guys look at this ==> http://blah.com/lolcats.JPEG&lt;/p&gt;

Live example

Peter Ajtai 2010-08-19 06:07:00

would be more logical to use a set structure and just not add the dupes.

Mark 2010-08-19 06:35:24

Answer 2

+3 A:

What's wrong with using the DOM? It gives you much better control over the context of the information and a much higher likelihood that the things you pull out are actually URLs.

<?php
$resultFromCurl = '
    <html>
    <body>
    <img src="hello.jpg" />
    <a href="yep.jpg">Yep</a>
    <table background="yep.jpg">
    </table>
    <p>
        Perhaps you should check out foo.jpg! I promise it 
        is safe for work.
    </p>
    </body>
    </html>
';

// these are all the attributes i could think of that
// can contain URLs.
$queries = array(
    '//table/@background',
    '//img/@src',
    '//input/@src',
    '//a/@href',
    '//area/@href',
    '//img/@longdesc',
);

$dom = @DOMDocument::loadHtml($resultFromCurl);
$xpath = new DOMXPath($dom);

$urls = array();
foreach ($queries as $query) {
    foreach ($xpath->query($query) as $link) {
        if (preg_match('@\.(gif|jpe?g|png)$@', $link->textContent))
            $urls[$link->textContent] = true;
    }
}

if (preg_match_all('@\b[^\s]+\.(?:gif|jpe?g|png)\b@', $dom->textContent, $matches)) {
    foreach ($matches as $m) {
        $urls[$m[0]] = true;
    }
}

$urls = array_keys($urls);
var_dump($urls);

Shabbyrobe 2010-08-19 06:09:37

What about URLs that appear in text outside of attributes?

Mark Trapp 2010-08-19 06:10:38

I've added that to the answer.

Shabbyrobe 2010-08-19 06:32:06

ansaurus

tags:

views:

answers:

Scrape unique image URLs from HTML

related questions