I noticed a strange phenomenon when using the apache httpclient libraries and I want to know why it occurs. I created some sample code to demonstrate. Consider the following code:
//Example URL
String url = "http://rads.stackoverflow.com/amzn/click/05961580";
GetMethod get = new GetMethod(url);
HttpMethodRetryHandler httpHandler = new DefaultHttpMethodRetryHandler(1, false);
get.getParams().setParameter(HttpMethodParams.RETRY_HANDLER, httpHandler );
get.getParams().setCookiePolicy(CookiePolicy.IGNORE_COOKIES);
HttpConnectionManager connectionManager = new SimpleHttpConnectionManager();
HttpClient client = new HttpClient( connectionManager );
client.getParams().setParameter("http.useragent", FIREFOX );
String line;
StringBuilder stringBuilder = new StringBuilder();
String toStreamBody = null;
String toStringBody = null;
try {
int statusCode = client.executeMethod(get);
if( statusCode != HttpStatus.SC_OK ){
System.err.println("Internet Status: " + HttpStatus.getStatusText(statusCode) );
System.err.println("While getting page: " + url );
}
//toString
toStringBody = get.getResponseBodyAsString();
//toStream
InputStreamReader isr = new InputStreamReader(get.getResponseBodyAsStream())
BufferedReader rd = new BufferedReader(isr);
while ((line = rd.readLine()) != null) {
stringBuilder.append(line);
}
} catch (java.io.IOException ex) {
System.out.println( "Failed to get page: " + url);
} finally {
get.releaseConnection();
}
toStreamBody = stringBuilder.toString();
This code prints nothing:
System.out.println(toStringBody); // ""
This code prints the web page:
System.out.println(toStreamBody); // "Whole Page"
But it gets even stranger... Replace:
get.getResponseBodyAsString();
With:
get.getResponseBodyAsString(150000);
Now we get the error:
Failed to get page: http://www.amazon.com/gp/offer-listing/0596158068/ref=dp_olp_used?ie=UTF8
I was unable to find another website besides for amazon that replicates this behavior but I assume there are others.
I am aware that according to the documentation at http://hc.apache.org/httpclient-3.x/performance.html
discourages the use of getResponseBodyAsString()
, it does not say that the page will not load, only that you may be at risk of an out of memory exception. Is it possible that getResponseBodyAsString()
is returning the page before it loads? Why does this only happen with amazon?