ansaurus

Question

How would you pick the best image from a webpage in a crawler?

Answer 1

A:

This is best-guess stuff, but:

ignoring anything hosted in another domain will eliminate most ads
once you've grabbed the images, you can get their size; the biggest is probably the one to use.
images that are inside <a> and point to the root of the domain are probably logos. Example: the SO logo on this page is inside <a href="/"></a>.

Edited to add:

It's true that large sites use auxiliary servers for their images. But you could probably make up a couple of simple parsing rules that will get 80% of cases, picking out g-ecx.images-amazon.com and static.ak.fbcdn.net as non-ad servers.

egrunin 2010-07-17 02:55:26

good start, one note...most big sites use CDN's that aren't in their domain so I wouldn't be able to ignore non-same domain images

2010-07-17 03:39:34

Answer 2

+1 A:

If you find og:image meta property, you can use that quite safely, as part of Open Graph specification used to provide images for Facebook links.

Example of format:

<html xmlns:og="http://opengraphprotocol.org/schema/"&gt;
    <head>
        <title>The Rock (1996)</title>
        <meta property="og:image" content="http://ia.media-imdb.com/rock.jpg"/&gt;
        ...
    </head>
    ...
</html>

che 2010-07-17 06:55:04

Answer 3

A:

Well I would try to look for divs/spans/h1 with something like class or id = "logo" or "top". Almost every page has its logo on the top of page. Just look on stackoverflow :) logo.

I do it this way in my crawler and it works fine :)

Pirozek 2010-07-17 08:09:41

Answer 4

+1 A:

Try to analyze the structure of the page. The majority of web pages roughly has a header, content and footer area. The content area is most likely to contain images related to the subject of the page, so that's what you're looking for.

Find the content area

Most content areas are div elements with with an ID or class named content, so that's always a good first guess. There may be alternative descriptors of the content element, so you'll need to do some research to find common patterns.

The content area will also contain multiple h1 or h2 headings in most cases, so that's another indicator to look for.

Find the header and footer

Another approach is to identify the header and footer. Headers usually contain a hint to the logo of the site, such as an image, CSS class name or link to the root of the site. Footers are most likely to contain things like copyright statements.

You can also find the header and footer by analyzing the links on the page. Most internal links will be in the header and footer, while the content has relatively more outgoing links, if any.

Once you have the header and footer, the content is usually in between :)

Find an image

Once you've identified the content area, the first image is usually your best pick. You should, however, ignore images with a small width and/or height, as these will likely be decorative images.

You could also double-check the images against any included CSS files, to make sure you're not picking an image that's related to the design of the page.

Fall back to an educated guess

If you cannot reliably guess the content area of the page, just use the biggest image on the page, as egrunin suggested. Again, you can check this image against the CSS files, to rule out any design-related images.

In the fall-back case, you could log the URL and review those pages to improve your image detection algorithms.

Niels van der Rest 2010-07-17 09:11:40

ansaurus

tags:

views:

answers:

How would you pick the best image from a webpage in a crawler?

Find the content area

Find the header and footer

Find an image

Fall back to an educated guess

related questions