views:

139

answers:

3

So I'm looking for ideas on how to best replicate the functionality seen on digg. Essentially, you submit a URL of your page of interest, digg then crawl's the DOM to find all of the IMG tags (likely only selecting a few that are above a certain height/width) and then creates a thumbnail from them and asks you which you would like to represent your submission.

While there's a lot going on there, I'm mainly interested in the best method to retrieve the images from the submitted page.

A: 

While you could try to parse the web page HTML can be such a mess that you would be best with something close but imperfect.

  1. Extract everything that looks like an image tag reference.
  2. Try to fetch the URL
  3. Check if you got an image back

Just looking for and capturing the content of src="..." would get you there. Some basic manipulation to deal with relative vs. absolute image references and you're there.

Obviously anytime you fetch a web asset on demand from a third party you need to take care you aren't being abused.

caskey
A: 

I suggest cURL + regexp.

Flavius Stef
A: 

You can also use PHP Simple HTML DOM Parser which will help you search all the image tags.

Shoban
Nice. Reminds me of PHPQuery (Modeled after jQuery). This appears to be more precise though. Thanks for the suggestion.
Jonathan Sampson
May not be best but fast ;-)
Shoban