Try to analyze the structure of the page. The majority of web pages roughly has a header, content and footer area. The content area is most likely to contain images related to the subject of the page, so that's what you're looking for.
Find the content area
Most content areas are div
elements with with an ID or class named content
, so that's always a good first guess. There may be alternative descriptors of the content element, so you'll need to do some research to find common patterns.
The content area will also contain multiple h1
or h2
headings in most cases, so that's another indicator to look for.
Find the header and footer
Another approach is to identify the header and footer. Headers usually contain a hint to the logo of the site, such as an image, CSS class name or link to the root of the site. Footers are most likely to contain things like copyright statements.
You can also find the header and footer by analyzing the links on the page. Most internal links will be in the header and footer, while the content has relatively more outgoing links, if any.
Once you have the header and footer, the content is usually in between :)
Find an image
Once you've identified the content area, the first image is usually your best pick. You should, however, ignore images with a small width and/or height, as these will likely be decorative images.
You could also double-check the images against any included CSS files, to make sure you're not picking an image that's related to the design of the page.
Fall back to an educated guess
If you cannot reliably guess the content area of the page, just use the biggest image on the page, as egrunin suggested. Again, you can check this image against the CSS files, to rule out any design-related images.
In the fall-back case, you could log the URL and review those pages to improve your image detection algorithms.