views:

36

answers:

4

If you were given any random webpage on the internet and had the html source only. What method would use to give you the most accurate image that would best describe that webpage? Assume that there are no meta tags or hints.

Facebook does something similar when you post a link but they give you choices of n images to chose from, they don't actually pick one unless it has the meta tags on it.

A: 

This is best-guess stuff, but:

  • ignoring anything hosted in another domain will eliminate most ads
  • once you've grabbed the images, you can get their size; the biggest is probably the one to use.
  • images that are inside <a> and point to the root of the domain are probably logos. Example: the SO logo on this page is inside <a href="/"></a>.

Edited to add:

It's true that large sites use auxiliary servers for their images. But you could probably make up a couple of simple parsing rules that will get 80% of cases, picking out g-ecx.images-amazon.com and static.ak.fbcdn.net as non-ad servers.

egrunin
good start, one note...most big sites use CDN's that aren't in their domain so I wouldn't be able to ignore non-same domain images
+1  A: 

If you find og:image meta property, you can use that quite safely, as part of Open Graph specification used to provide images for Facebook links.

Example of format:

<html xmlns:og="http://opengraphprotocol.org/schema/"&gt;
    <head>
        <title>The Rock (1996)</title>
        <meta property="og:image" content="http://ia.media-imdb.com/rock.jpg"/&gt;
        ...
    </head>
    ...
</html>
che
A: 

Well I would try to look for divs/spans/h1 with something like class or id = "logo" or "top". Almost every page has its logo on the top of page. Just look on stackoverflow :) logo.

I do it this way in my crawler and it works fine :)

Pirozek
+1  A: 

Try to analyze the structure of the page. The majority of web pages roughly has a header, content and footer area. The content area is most likely to contain images related to the subject of the page, so that's what you're looking for.

Find the content area

Most content areas are div elements with with an ID or class named content, so that's always a good first guess. There may be alternative descriptors of the content element, so you'll need to do some research to find common patterns.

The content area will also contain multiple h1 or h2 headings in most cases, so that's another indicator to look for.

Find the header and footer

Another approach is to identify the header and footer. Headers usually contain a hint to the logo of the site, such as an image, CSS class name or link to the root of the site. Footers are most likely to contain things like copyright statements.

You can also find the header and footer by analyzing the links on the page. Most internal links will be in the header and footer, while the content has relatively more outgoing links, if any.

Once you have the header and footer, the content is usually in between :)

Find an image

Once you've identified the content area, the first image is usually your best pick. You should, however, ignore images with a small width and/or height, as these will likely be decorative images.

You could also double-check the images against any included CSS files, to make sure you're not picking an image that's related to the design of the page.

Fall back to an educated guess

If you cannot reliably guess the content area of the page, just use the biggest image on the page, as egrunin suggested. Again, you can check this image against the CSS files, to rule out any design-related images.

In the fall-back case, you could log the URL and review those pages to improve your image detection algorithms.

Niels van der Rest