views:

72

answers:

3

I need to scrape a remote html page looking for images and links. I need to find an image that is "most likely" the product image on the page and links that are "near" that image. I currently do this with a javascript bookmarklet so that I am able to get the rendered x/y coordinates of images and links to help me determine if those are the ones that I want.

What I want is the ability to get this information by just using a url and not the bookmarklet. The issues it that by using the url and trying something like httpwebrequest and getting the html on the server, I will not have location values since it wasn't rendered in a browser. I need the location of images and links to help me determine the images and links that I want.

So how can I get html from a remote site on the server AND use the rendered location values of the dom elements to help me locate images and links?

+1  A: 

As you indicate, doing this purely through inspection of the html is a royal pain (especially when CSS gets involved). You could try using the WebBrowser control (which hosts IE), but I wonder if looking for an appropriate, supported API might be better (and less likely to get you blocked). If there isn't an API or similar, you probably shouldn't be doing this. So don't.

Marc Gravell
I'm talking about the same functionality as what Facebook has for it's adding url to updates. Funny how on this board everyone assumes everyone else is doing something bad.
mike
A: 

You can dowload the page with HttpWebRequet and then use the HtmlAgilityPack to parse out the data that you need.

You can download it from http://htmlagilitypack.codeplex.com/

Chris Almond
Is it possible for the HtmlAbilityPack to get the location on the screen of each dom element? or.. for that matter the rendered images size. I'm assuming no. These properties really help to make my current bookmarklet pretty accurate.
mike
A: 

I recommend that you either code it yourself with the webbrowser control or use one of the available toolkits that works in a web browser, like WatiR or iMacros. There you can define that you want something near another element.

SamMeiers