Let's say you're given http://nytimes.com How would you pull out the "main" image?
The reason I'm asking is because Flipboard is able to grab the main image from a website, just using the URL.
You could parse out all the image tags. But then what?
Let's say you're given http://nytimes.com How would you pull out the "main" image?
The reason I'm asking is because Flipboard is able to grab the main image from a website, just using the URL.
You could parse out all the image tags. But then what?
Facebook allows the user to pick one of several images that it has deemed to be a "main" image. As far as automatically determining a "main" image, I would judge it based on page position, size, relation to text, and (if you wanted to be more sophisticated) its visual content.
For example, you could use a simple face detection program, or look at color breakdowns to determine if the picture was "interesting" to you or not.
EDIT: In the case of www.nytimes.com, I would probably just look at the page structure, because a large carousel of images is located right underneath an H1 tag.
There really isn't anything that is considered the "main" image in a web page--nothing in HTML or otherwise to distinguish this. Not to mention you'd probably have to read all the images in CSS (or rather the background images etc). But if I had to do this, here is what I would do:
Once I had reached a predefined number of images I've checked
(for argument lets say 10, but surely you'd probably go much higher) I'd use the largest image I've found (stored off to the side) because I wouldn't want to scan the page indefinitely looking for images!