views:

39

answers:

4

I have a website that contains many different pages of products and each page has a certain amount of images in the same format across all pages. I want to be able to screen scrap each page's url so I can retrieve the url of each image from each page. The idea is to make a gallery for each page made up of hotlinked images.

I know this can be done in php, but I am not sure how to scrap the page for multiple links. Any ideas?

A: 

You can use a regular expression (regex) to go through the page source and parse all the IMG tags.

This regex will do the job quite nicely: <img[^>]+src="(.*?)"

How does this work?

// <img[^>]+src="(.*?)"
// 
// Match the characters "<img" literally «<img»
// Match any character that is not a ">" «[^>]+»
//    Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// Match the characters "src="" literally «src="»
// Match the regular expression below and capture its match into backreference number 1 «(.*?)»
//    Match any single character that is not a line break character «.*?»
//       Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
// Match the character """ literally «"»

Sample PHP code:

preg_match_all('/<img[^>]+src="(.*?)"/i', $subject, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) {
    // image URL is in $result[0][$i];
}

You'll have to do a bit more work to resolve things like relative URLs.

Ben Herila
What if the src is surrounded with single quotes, for instance? Due to the possibility of inconsistencies that would be a pain to account for, generally, using an XML parser is a better solution.
Alex JL
Thanks for the help, do I need to use a regex if I know that there is only one possible way the image is being displayed?e.g all images follow the format:<div title="" largeUrl="http://a.com/image.jpg">
Jeremy
+1  A: 

I would recommend using a DOM parser, such as PHP's very own DOMDocument. Example:

$page = file_get_contents('http://example.com/images.php');
$doc = new DOMDocument(); 
$doc->loadHTML($page);
$images = $doc->getElementsByTagName('img'); 
foreach($images as $image) {
    echo $image->getAttribute('src') . '<br />';
}
karim79
I tried this but get this error about 15 times down the page:Notice: Undefined property: DOMElement::$src in C:\Users\User\Desktop\PortableWebAp4.0\PortableWebAp4.0.pro\Program\www\localhost\test.php on line 12Do I need to include the DomDocument or anything? Sorry I am new to PHP.
Jeremy
@Jeremy - Initially, my code had a mistake in it, which I've since corrected. I changed `$image->src` to `$image->getAttribute('src')`, which is the correct way to get an attribute. I should have commented when I corrected it, sorry for that.
karim79
Thanks so much it works perfectly! I was looking at some other code today that was 5x longer than that and this works brilliantly! Cheers!
Jeremy
A: 

I really like PHP Simple HTML DOM Parser for things like this. An example of grabbing images is right there on the front page:

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';
Calvin L
A: 

You can you this to scrap pages.

http://simplehtmldom.sourceforge.net/

but it requires PHP 5+.

Chetan sharma