Techniques for extracting the 'best' image from a webpage.

views:

answers:

+4 Q:

Techniques for extracting the 'best' image from a webpage.

I'm trying to build something akin to Facebook's "Share" functionality for my website.

I've gotten to the point where I can accept a URL, scrape it for meta keywords and suitably get titles/descriptions, but I'm a bit stuck as to the best way to determine 'likely' photos the user may want to share.

I currently use the SimpleXMLElement to turn the page into a traversable DOM, and find all the tags, turning them into absolute URLs. After that, I'm not sure how I can go about finding a suitable thumbnail.

Do I download them all, and go by file size? Do I use some sort of heuristic like, "was encountered in the middle of the page"?

Does anyone else have any recommendations, suggestions, or tips?

I don't have any direct experience doing this so I'm not sure that there is any specific best practice, but in general I think a heuristic approach looking at several factors would make sense because of the variability found in website implementations.

I would look at two sets of items: image properties and the context of the where/how the images are placed.

Image Properties:

Width and height meet minimum thresholds
Aspect ratio is reasonable (background images that tile may have extreme aspect ratios, which provides a good indication that the image may not be suitable)
More than one color exists in image (harder to detect, but may avoid various background images)

Image Context:

Image does not repeat on page (this avoids using icons and other design elements that may repeat)
Occurs after h1, h2, etc tags on page; this gets to your point about the images coming from the middle of the page, again avoiding design elements.
Has an alt tag (though this is not consistently used, so perhaps does not provide much useful information)

I would assigns weights to the previous items and then rank the images you find according to how well each image satisfies the rules.

Also, note that some pages may use CSS (or Flash, etc) to display images. These our outside of your purview of images (according to the algorithm you defined); perhaps not a big deal, but something to consider.

mcliedtk 2010-03-11 01:33:57

+3 A:

I wrote something similar a while ago to get images from scraped blog posts. My criteria for choosing an image was something along the lines of getting a list of all images on the page then assigning 'priority points':

Ignore images hosted from a blacklist taken from AdBlocker's list
Ignore indirect images, eg linked to from stylesheets or in an IFRAME
Ignore images under 50 pixels wide or high
Ignore images which repeat more than once
Assign priority points to images hosted from a whitelist of hosts (eg photobucket, imageshack.us)
Assign priority points to the largest 3 images on the page
Assign priority points to images on the same host
Assign priority points to images with an ALT tag defined
Assign priority points to images appearing in a P tag

Then pick the one with the most priority points. It certainly wasn't foolproof or overly scientific but it got something useful far more often than not.

FerretallicA 2010-03-11 01:38:44

This is where I think I was headed, but I definitely appreciate the indepth list. I'll give it a try, see if I can add my own additions to it.

Eddie Parker 2010-03-11 02:19:58

I found the code for the aforementioned project. Most of the filtering is covered but I left out ignoring any images which appear in <li> and <h{1-6}>. At one point I was also assigning weights to images in order of where they appeared but it's commented out with "too skewed" so I'm assuming it was for a good reason even though I don't remember doing it...

FerretallicA 2010-03-11 21:32:19

The heuristic stuff seems to work pretty well I've found. The only issue I'm having currently is that treating every web page as a valid xml document (using SimpleXMLElement) seems to fail rather regularly.Did you implement yours using regexes then? Or is there some sort of better parser I could use for this?

Eddie Parker 2010-03-12 05:49:42

To answer my own query, I found DOMDocument::loadHTMLFile [1] works wonders.[1] http://ca3.php.net/manual/en/domdocument.loadhtmlfile.php

Eddie Parker 2010-03-12 07:12:15

ansaurus

tags:

views:

answers:

Techniques for extracting the 'best' image from a webpage.

related questions