views:

106

answers:

5

I need to write a script that go to a web site, logs in, navigates to a page and downloads (and after that parse) the html of that page.

What I want is a standalone script, not a script that controls Firefox. I don't need any javascript support in that just simple html navigation.

If nothing easy to do this exists.. well then something that acts though a web browser (firefox or safari, I'm on mac).

thanks

A: 

If you wanted to use PHP, you could use the cURL functions to build your own simple web page scraper.

For an idea of how to get started, see: http://us2.php.net/manual/en/curl.examples-basic.php

Ben Gribaudo
A: 

This is PROBABLY a dumb question, since I have no knowledge of mac but what language are we talking about here, and also is this a website that you have control over, or something like a spider bot that google might use when checking page content? I know that in C# you can load in objects on other sites using an HttpWebRequest and a stream reader... In java script (this would only really work if you know what is SUPPOSED to be there) you could open the web page as the source of an iframe, and using java script traverse the contents of all the elements on the page... or better yet, use jquery.

Patrick
A: 

I need to write a script that go to a web site, logs in, navigates to a page and downloads (and after that parse) the html of that page.

To me this just sounds like a POST or GET request to the URL of the login page could do the job.With the proper parameters username and password (depending on the form input names used on the page) set in the request, the result will be the html of the page that you can then parse as you please.

This can be done with virtually any language. What language do you want to use?

jd
yes you're right I can do this. But I was hoping in something more complex and designed to the task but I'll try this way in ruby.
luca
A: 

I've no knowledge of pre-built general purpose scrapers, but you may be able to find one via Google.

Writing a web scraper is definitely doable. In my very limited experience (I've written only a couple), I did not need to deal with login/security issues, but in Googling around I saw some examples that dealt with them - afraid I don't remember URL's for those pages. I did need to know some specifics about the pages I was scraping; having that made it easier to write the scraper, but, of course, the scrapers were limited to use on those pages. However, if you're just grabbing the entire page, you may only need the URL(s) of the page(s) in question.

Without knowing what language(s) would be acceptable to you, it is difficult to help much more. FWIW, I've done scrapers in PHP and Python. As Ben G. said, PHP has cURL to help with this; maybe there are more, but I don't know PHP very well. Python has several modules you might choose from, including lxml, BeautifulSoup, and HTMLParser.

Edit: If you're on Unix/Linux (or, I presume, CygWin) You may be able to achieve what you want with wget.

PTBNL
A: 

I recently did exactly what you’re asking for in a C# project. If login is required your first request is likely to be a post and include credentials. The response will usually include cookies which persist the identity across subsequent requests. Use Fiddler to look at what form data (field names and values) is being posted to the server when you logon normally with your browser. Once you have this you can construct an HttpWebRequest with the form data and store the cookies from the response in a CookieContainer.

The next step is to make the request for the content you actually want. This will be another HttpWebRequest with the CookieContainer attached. The response can be read by a StreamReader which you can than read and convert to a string.

Each time I’ve done this it has usually been a pretty laborious process to identify all the relevant form data and recreate the requests manually. Use Fiddler extensively and compare the requests your browser is making when using the site normally with the requests coming from your script. You may also need to manipulate the request headers; again, use Fiddler to construct these by hand, get them submitting correctly and the response as you expect then code it. Good luck!

Troy Hunt