views:

147

answers:

3

I want to pull the entire HTML source code file from a website in Java (or Python or PHP if it is easier in those languages to display). I wish only to view the HTML and scan through it with a few methods- not edit or manipulate it in any way, and I really wish that I do not write it to a new file unless there is no other way. Are there any library classes or methods that do this? If not, is there any way of going about this?

+2  A: 

In Python:

import urllib
# Get a file-like object for the Python Web site's home page.
f = urllib.urlopen("http://www.python.org")
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()

Please see Python and HTML Processing for more details.

Vijay Mathew
+5  A: 

In Java:

URL url = new URL("http://stackoverflow.com");
URLConnection connection = new URLConnection(url);
InputStream stream = url.openConnection();
// ... read stream like any file stream

This code, is good for scripting purposes and internal use. I would argue against using it for production use though. It doesn't handle timeouts and failed connections.

I would recommend using HttpClient library for production use. It supports authentication, redirect handling, threading, pooling, etc.

notnoop
I think I'm doing something wrong. The compiler tells me that URLConnection cannot be instantiated (it's an abstract class). How do I instantiate it correct, or is there a subclass for URLConnection that can be instantiated?
Brian
I think it should beURL hp = new URL("http://stackoverflow.com"); URLConnection hpCon = hp.openConnection();
GustlyWind
@GustlyWind, thank. Should've actually checked the code.
notnoop
A: 

Maybe you should also consider an alternative like running a standard utility like wget or curl from the command line to fetch the site tree into a local directory tree. Then do your scanning (in Java, Python, whatever) using the local copy. It should be simpler to do that, than to implement all of the boring stuff like error handling, argument parsing, etc yourself.

If you want to fetch all pages in a site, wget and curl don't know how to harvest links from HTML pages. An alternative is to use an open source web crawler.

Stephen C