ansaurus

Question

Answer 1

A:

You can use a regular expression (regex) to go through the page source and parse all the IMG tags.

This regex will do the job quite nicely: <img[^>]+src="(.*?)"

How does this work?

// <img[^>]+src="(.*?)"
// 
// Match the characters "<img" literally «<img»
// Match any character that is not a ">" «[^>]+»
//    Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// Match the characters "src="" literally «src="»
// Match the regular expression below and capture its match into backreference number 1 «(.*?)»
//    Match any single character that is not a line break character «.*?»
//       Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
// Match the character """ literally «"»

Sample PHP code:

preg_match_all('/<img[^>]+src="(.*?)"/i', $subject, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) {
    // image URL is in $result[0][$i];
}

You'll have to do a bit more work to resolve things like relative URLs.

Ben Herila 2010-07-16 04:04:08

What if the src is surrounded with single quotes, for instance? Due to the possibility of inconsistencies that would be a pain to account for, generally, using an XML parser is a better solution.

Alex JL 2010-07-16 04:10:22

Thanks for the help, do I need to use a regex if I know that there is only one possible way the image is being displayed?e.g all images follow the format:<div title="" largeUrl="http://a.com/image.jpg">

Jeremy 2010-07-16 04:24:06

Answer 2

+1 A:

I would recommend using a DOM parser, such as PHP's very own DOMDocument. Example:

$page = file_get_contents('http://example.com/images.php');
$doc = new DOMDocument(); 
$doc->loadHTML($page);
$images = $doc->getElementsByTagName('img'); 
foreach($images as $image) {
    echo $image->getAttribute('src') . '<br />';
}

karim79 2010-07-16 04:05:52

I tried this but get this error about 15 times down the page:Notice: Undefined property: DOMElement::$src in C:\Users\User\Desktop\PortableWebAp4.0\PortableWebAp4.0.pro\Program\www\localhost\test.php on line 12Do I need to include the DomDocument or anything? Sorry I am new to PHP.

Jeremy 2010-07-16 04:22:17

@Jeremy - Initially, my code had a mistake in it, which I've since corrected. I changed `$image->src` to `$image->getAttribute('src')`, which is the correct way to get an attribute. I should have commented when I corrected it, sorry for that.

karim79 2010-07-16 04:32:20

Thanks so much it works perfectly! I was looking at some other code today that was 5x longer than that and this works brilliantly! Cheers!

Jeremy 2010-07-16 04:38:29

Answer 3

A:

I really like PHP Simple HTML DOM Parser for things like this. An example of grabbing images is right there on the front page:

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

Calvin L 2010-07-16 04:12:35

Answer 4

A:

You can you this to scrap pages.

http://simplehtmldom.sourceforge.net/

but it requires PHP 5+.

Chetan sharma 2010-07-16 04:24:10

ansaurus

tags:

views:

answers:

Screen Scraping of Image Links in PHP

related questions