views:

43

answers:

3

I have a couple of twitter-powered news aggregation website. I have been planning to add images from articles that I find on twitter.

If I download the page and extract image using <img> tag, I get a bunch of images; not all of them relevant to the article. For example, images of button, icons, ads etc are captured. How do I extract the image accompanying the article? I know there is a solution -- Facebook link sharer does this pretty well.

Mithun

A: 

I would guess that Facebook has a link extractor for the various sites it supports. Something like id="content" -> img (1st).

Guess I am wrong. Seems that Facebook uses the Open Graph Protocol to define which image (og:image) and which metadata to use.

Serkan
Well OGP is something Facebook is pushing so that they can extract meta-data accurately. Unfortunately, a large number of website do not follow this standard.
mithun
+1  A: 

I kind of came-up with a solution that is a bit hacky but works for me. Here is what I do to get thumbnails.

  1. Say the headline of the page I find is "this is a headline"
  2. I use this as a query to the Google Image API and then extract the first thumbnail I find.

It actually works quite well for a majority of the cases. Check it out for yourself http://cricketfresh.in

Mithun

ps: I think this is a good answer. Will give credit to someone who comes with a more elegant answer.

mithun
A: 

Download all images from the page, blacklist all images coming from an ad server. then find some heuristic which will get you the correct image...

I think something like:

  • Biggest resolution += 5pts
  • Biggest filesize += 10 pts
  • Jpeg += 2 pts

then take the image with the most points and throw the rest away

Probably works for majority of sites.

(Would require some fiddling with the heuristics though)

Toad
This is the classic approach and thank you for putting it down here. I was a bit hesitant to go down this path because I was not sure how long this will take. Like you said, it will probably work great after some tuning. Couple more factores that I found elsewhere are: 1] the path of the image. 2] images whose width and height are specified
mithun