ansaurus

Question

Answer 1

+2 A:

In Python:

import urllib
# Get a file-like object for the Python Web site's home page.
f = urllib.urlopen("http://www.python.org")
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()

Please see Python and HTML Processing for more details.

Vijay Mathew 2009-12-03 03:41:13

Answer 2

+5 A:

In Java:

URL url = new URL("http://stackoverflow.com");
URLConnection connection = new URLConnection(url);
InputStream stream = url.openConnection();
// ... read stream like any file stream

This code, is good for scripting purposes and internal use. I would argue against using it for production use though. It doesn't handle timeouts and failed connections.

I would recommend using HttpClient library for production use. It supports authentication, redirect handling, threading, pooling, etc.

notnoop 2009-12-03 03:44:10

I think I'm doing something wrong. The compiler tells me that URLConnection cannot be instantiated (it's an abstract class). How do I instantiate it correct, or is there a subclass for URLConnection that can be instantiated?

Brian 2009-12-03 03:56:32

I think it should beURL hp = new URL("http://stackoverflow.com"); URLConnection hpCon = hp.openConnection();

GustlyWind 2009-12-03 04:10:30

@GustlyWind, thank. Should've actually checked the code.

notnoop 2009-12-03 05:27:53

Answer 3

A:

Maybe you should also consider an alternative like running a standard utility like wget or curl from the command line to fetch the site tree into a local directory tree. Then do your scanning (in Java, Python, whatever) using the local copy. It should be simpler to do that, than to implement all of the boring stuff like error handling, argument parsing, etc yourself.

If you want to fetch all pages in a site, wget and curl don't know how to harvest links from HTML pages. An alternative is to use an open source web crawler.

Stephen C 2009-12-03 06:25:36

ansaurus

tags:

views:

answers:

Pulling HTML from a Webpage in Java

related questions