ansaurus

Question

What is the optimal way for reading the contents of a webpage into a string in Java?

Answer 1

+2 A:

OK, edited once more. Be sure to put your try-finally blocks around it, or catch IOException

 ...
 final static int BUFZ = 4096;
 StringBuilder page = new StringBuilder();
 HttpURLConnection conn = 
    (HttpURLConnection) new URL(url).openConnection();
 InputStream is = conn.getInputStream()
 // perhaps allocate this one time and reuse if you
  //call this method a lot.
 byte[] buf = new byte[BUFZ] ;
 int nRead = 0;

 while((nRead = is.read(buf, 0, BUFZ) > 0) {
    page.append(new String(buf /* , Charset charset */)); 
 // uses local default char encoding for now
 }

Here try this one:

 ...
 final static int MAX_SIZE = 10000000;
 HttpURLConnection conn = 
    (HttpURLConnection) new URL(url).openConnection();
 InputStream is = conn.getInputStream()
 // perhaps allocate this one time and reuse if you
  //call this method a lot.
 byte[] buf = new byte[MAX_SIZE] ;
 int nRead = 0;
 int total = 0;
 // you could also use ArrayList so that you could dynamically
 //  resize or there are other ways to resize an array also
 while(total < MAX_SIZE && (nRead = is.read(buf) > 0) {
      total += nRead;
 }
 ...
 // do something with buf array of length total

OK the code below was not working for you because the Content-length header line is not being sent in the beginning due to HTTP/1.1 "chunking"

 ...
 HttpURLConnection conn = 
    (HttpURLConnection) new URL(url).openConnection();
 InputStream is = conn.getInputStream()
 int cLen = conn.getContentLength() ;
 byte[] buf = new byte[cLen] ;
 int nRead=0 ;

 while(nRead < cLen) {
      nRead += is.read(buf, nRead, cLen - nRead) ;
 }
 ...
 // do something with buf array

Sean A.O. Harney 2009-07-25 14:42:42

`conn.getContentLength()` returns -1 for all of the pages I'm connecting to since it isn't know.

Ben S 2009-07-25 15:07:06

Are you getting chunked HTTP 1.1 content?

Sean A.O. Harney 2009-07-25 15:12:32

Maybe? I'm not sure what you mean by chunked HTTP 1.1 content. The reply I get from the URL is only an HTML fragment. Normally the URL I'm connecting to is used to fetch HTML for an AJAX request. I'm not sure if this has any effect on this.

Ben S 2009-07-25 15:14:39

From a firebug session connecting to the URL I'm using: "Transfer-Encoding: chunked"

Ben S 2009-07-25 15:15:49

Your latest solution will not provide a speed increase. You're creating instances of Strings uselessly for every `read`. You also do no close the anything which might lead to leaks. Allocating a buffer which is too large is a waste of time and space and since my responses can vary greatly in size, this isn't a suitable solution.

Ben S 2009-07-25 20:01:07

Well I am taking advantage of the String constructor to convert byte[] to String. It probably runs as fast as a toString() method which would also use probably the exact same amount of heap as the toString() method also instantiates a new String object! I changed my buffer size to 4KB, this should be efficient enough in a loop, I really doubt if BufferedReader gets much bettr performance, although it would probably be best to set it to something near the JVM machine page size.And as far as closing things my code snippet is enclosed in a pair of elipses (...) so I don't have to close things!

Sean A.O. Harney 2009-07-25 22:20:59

Answer 2

+1 A:

You could do your own buffering on top of InputStreamReader by reading bigger chunks into a character array and appending the array contents to the StringBuilder.

But it would make your code slightly harder to understand, and I doubt it would be worth it.

Note that the proposal by Sean A.O. Harney reads raw bytes, so you'd need to do the conversion to text on top of that.

starblue 2009-07-25 15:36:27

Thanks, I updated my proposed answer. I am using the String constructor to convert the byte[] to String and appending to the StringBuilder as you mentioned. Unfortunately StringBuilder.append() only is defined for char[] or String argument, not byte[] .

Sean A.O. Harney 2009-07-25 22:24:08

Answer 3

+8 A:

As seen in the other answers, there are many different edge cases (HTTP peculiarities, encoding, chunking, etc) that should be accounted for in any robust solution. Therefore I propose that in anything other than a toy program you use the de facto Java standard HTTP library: Apache HTTP Components HTTP Client.

They provide many samples, "just" getting the response contents for a request looks like this:

HttpClient httpclient = new DefaultHttpClient();
HttpGet httpget = new HttpGet("http://www.google.com/"); 
ResponseHandler<String> responseHandler = new BasicResponseHandler();    
String responseBody = httpclient.execute(httpget, responseHandler);
// responseBody now contains the contents of the page
System.out.println(responseBody);
httpclient.getConnectionManager().shutdown();

Boris Terzic 2009-07-25 16:18:38

This sounds like the best idea yet. I am going to start using this one. The java.net classes for http behave differently on different versions of Java also is another reason to avoid them.

Sean A.O. Harney 2009-07-25 17:52:54

I'm getting Apache commons suggestions left and right lately, so I'm going to give it a try. Thanks for the suggestion.

Ben S 2009-07-26 05:15:40

ansaurus

tags:

views:

answers:

What is the optimal way for reading the contents of a webpage into a string in Java?

related questions