views:

52

answers:

4

I'm struggling to write a Windows Service that accesses a website, logs in using stored credentials, and downloads the HTML to parse it. What do you think is the best way to go about this?

A: 

You could use the WebClient class.

Here are some examples (it's ASP.NET but the code applies equally to using in a service): Screen Scraping, ViewState, and Authentication using ASP.Net

Mitch Wheat
A: 

If you really have to do that (The webserver doesn't provide a webservice), use the HttpWebRequest (http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.aspx) and parse the HTML either with Regex or some html parsing lib.

Or WebClient, of course.

Sheeo
-1: HTML is not a regular language and cannot, in general, be parsed using regular expressions.
John Saunders
I think you mean *should* not. Really I believe it depends on what he wants to scrape from the page, and how much--if the data's simple a regex will suffice and is heaps faster than parsing the HTML and making a tree out of it.
Sheeo
+1: Hah, strange, isn't it? I've had success using regex detecting/extracting/replacing text in HTML streams.
Blessed Geek
A: 

if it's a specific website, it's possible that you can send the required POST data immediately and bypass parsing the login page. HttpWebRequest or WebClient are what you need. you need to open a connection, send the post data, and then retrieve your response. a bit more complicated than I feel like going into here :)

for parsing HTML pages, I've had success with HtmlAgilityPack

Mark
A: 

You can host an IRobotX activex control, and run a web robot to retrieve the page.

seagulf