Hello,
I have a made a news aggregator Newzupp which I want to modify. Right now I am simply displaying the titles of the news stories and I am linking them to their urls.
I am planning to make it more graphical, by using images + titles instead of plain titles. I want to know how can I get the main image of each article (somewhat similar to google news).
One way that I can think of is I can strip all the images and display the image which points the the same article. But I do not think that will be efficient. Is there any other way of doing this?
I have found a solution to it.
- Fetch the contents of the url [html/xml]
- Scrape the content using hpricot
- Find all elements with tag "img"
- Do some research to find which of them is the main display image. [Like 6th image in case of Wired.com's rss feed]
I still think this is highly inefficient. I would like to know how services like Google News scrape the sites/blogs and display relevant images.