tags:

views:

5593

answers:

6

I would like to be able to fetch a web page's html and save it to a String, so I can do some processing on it.

How would I go about doing that using Java?

+3  A: 

Well, you could go with the built-in libraries such as URL and URLConnection, but they don't give very much control.

Personally I'd go with the Apache HTTPClient library.

Jon Skeet
There is no java version of System.Net.WebRequest?
FlySwat
Sort of, that would be URL. :-) For example: new URL("http://www.google.com").openStream() // => InputStream
Daniel Spiewak
@Jonathan: What Daniel said, for the most part - although WebRequest gives you more control than URL. HTTPClient is closer in functionality, IMO.
Jon Skeet
A: 

On a Unix/Linux box you could just run 'wget' but this is not really an option if you're writing a cross-platform client. Of course this assumes that you don't really want to do much with the data you download between the point of downloading it and it hitting the disk.

Timo Geusch
i would also start with this approach and refactor it later if insufficient
Dustin Getz
+7  A: 

Here's some tested code using Java's URL class. I'd recommend do a better job than I do here of handling the exceptions or passing them up the call stack, though.

URL url;
InputStream is = null;
DataInputStream dis;
String line;

try {
    url = new URL("http://stackoverflow.com/");
    is = url.openStream();  // throws an IOException
    dis = new DataInputStream(new BufferedInputStream(is));

    while ((line = dis.readLine()) != null) {
        System.out.println(line);
    }
} catch (MalformedURLException mue) {
     mue.printStackTrace();
} catch (IOException ioe) {
     ioe.printStackTrace();
} finally {
    try {
        is.close();
    } catch (IOException ioe) {
        // nothing to see here
    }
}
Bill the Lizard
DataInputStream.readLine() is deprecated, but other than that very good example. I used an InputStreamReader() wrapped in a BufferedReader() to get the readLine() function.
mjh2007
A: 

Already answered here.

Scott Bennett-McLeish
+2  A: 

Bill's answer is very good, but you may want to do some things with the request like compression or user-agents. The following code shows how you can various types of compression to your requests.

URL url = new URL(urlStr);
HttpURLConnection conn = (HttpURLConnection) url.openConnection(); // Cast shouldn't fail
HttpURLConnection.setFollowRedirects(true);
// allow both GZip and Deflate (ZLib) encodings
conn.setRequestProperty("Accept-Encoding", "gzip, deflate");
String encoding = conn.getContentEncoding();
InputStream inStr = null;

// create the appropriate stream wrapper based on
// the encoding type
if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
    inStr = new GZIPInputStream(conn.getInputStream());
} else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
    inStr = new InflaterInputStream(conn.getInputStream(),
      new Inflater(true));
} else {
    inStr = conn.getInputStream();
}

To also set the user-agent add the following code:

conn.setRequestProperty ( "User-agent", "my agent name");
jjnguy
A: 

Nice post,

I make it as the following:

http://no-suelo.blogspot.com/2010/09/java-obtener-codigo-fuente-de-una.html

Regards

Albert Asensio