How do you find the "main" picture of a website, given the URL?

views:

answers:

How do you find the "main" picture of a website, given the URL?

Let's say you're given http://nytimes.com How would you pull out the "main" image?

The reason I'm asking is because Flipboard is able to grab the main image from a website, just using the URL.

You could parse out all the image tags. But then what?

Facebook allows the user to pick one of several images that it has deemed to be a "main" image. As far as automatically determining a "main" image, I would judge it based on page position, size, relation to text, and (if you wanted to be more sophisticated) its visual content.

For example, you could use a simple face detection program, or look at color breakdowns to determine if the picture was "interesting" to you or not.

EDIT: In the case of www.nytimes.com, I would probably just look at the page structure, because a large carousel of images is located right underneath an H1 tag.

Tim 2010-10-30 03:26:29

There really isn't anything that is considered the "main" image in a web page--nothing in HTML or otherwise to distinguish this. Not to mention you'd probably have to read all the images in CSS (or rather the background images etc). But if I had to do this, here is what I would do:

First I would decide a suitable image size, lets say a 400x400 minimum. (I don't want to pick any old image, something really small would likely scale horribly)
I would then iterate through each image on the page.2.
For each image I encountered I would check the size of it3. If it was 400x400 (my predefined size) or larger I would use this image. If it wasn't, I would check that its the largest image I've found so far and if so keep its information stored off to the side.
Once I had reached a predefined number of images I've checked

(for argument lets say 10, but surely you'd probably go much higher) I'd use the largest image I've found (stored off to the side) because I wouldn't want to scan the page indefinitely looking for images!

pinkfloydx33 2010-10-30 03:28:09

If you just look for the biggest image, you're likely to end up with a big ad, like a leaderboard (728x90) or skyscraper (120x600)

kijin 2010-10-30 04:24:40

That's very true. So you could restrict the size maximum as well.

pinkfloydx33 2010-10-30 04:37:23

Most ads are loaded through iframes so they wouldn't be part of the page. But if that's a concern you could just ignore any IAB standard sizes.

gabrielk 2010-10-31 06:43:43

do you mean logo ?

Krishna 2010-10-31 06:11:11

ansaurus

tags:

views:

answers:

How do you find the "main" picture of a website, given the URL?

related questions