views:

32

answers:

1

I noticed a strange phenomenon when using the apache httpclient libraries and I want to know why it occurs. I created some sample code to demonstrate. Consider the following code:

//Example URL
 String url = "http://rads.stackoverflow.com/amzn/click/05961580";
 GetMethod get = new GetMethod(url);
 HttpMethodRetryHandler httpHandler = new DefaultHttpMethodRetryHandler(1, false);
 get.getParams().setParameter(HttpMethodParams.RETRY_HANDLER, httpHandler );
 get.getParams().setCookiePolicy(CookiePolicy.IGNORE_COOKIES);
 HttpConnectionManager connectionManager = new SimpleHttpConnectionManager();
 HttpClient client = new HttpClient( connectionManager );
 client.getParams().setParameter("http.useragent", FIREFOX );
 String line;
 StringBuilder stringBuilder = new StringBuilder();
 String toStreamBody = null;
 String toStringBody = null;
 try {
  int statusCode = client.executeMethod(get);
  if( statusCode != HttpStatus.SC_OK ){
   System.err.println("Internet Status: " + HttpStatus.getStatusText(statusCode) );
   System.err.println("While getting page: " + url );
  }
 //toString
  toStringBody = get.getResponseBodyAsString();
 //toStream
  InputStreamReader isr = new InputStreamReader(get.getResponseBodyAsStream())
  BufferedReader rd = new BufferedReader(isr);
  while ((line = rd.readLine()) != null) {
  stringBuilder.append(line);
  }
 } catch (java.io.IOException ex) {
  System.out.println( "Failed to get page: " + url);
 } finally {
  get.releaseConnection();
 }       
 toStreamBody = stringBuilder.toString();

This code prints nothing:

 System.out.println(toStringBody); // ""

This code prints the web page:

 System.out.println(toStreamBody); // "Whole Page"

But it gets even stranger... Replace:

get.getResponseBodyAsString();

With:

 get.getResponseBodyAsString(150000);

Now we get the error: Failed to get page: http://www.amazon.com/gp/offer-listing/0596158068/ref=dp_olp_used?ie=UTF8

I was unable to find another website besides for amazon that replicates this behavior but I assume there are others.

I am aware that according to the documentation at http://hc.apache.org/httpclient-3.x/performance.html discourages the use of getResponseBodyAsString(), it does not say that the page will not load, only that you may be at risk of an out of memory exception. Is it possible that getResponseBodyAsString() is returning the page before it loads? Why does this only happen with amazon?

A: 

Did you test with any other URL?

The URL in code that you provided redirects with 302 to http://www.amazon.com/dp/05961580/?tag=stackoverfl08-20, which then returns 404 (not found).

HttpClient does not handle redirects: http://hc.apache.org/httpclient-3.x/redirects.html

Peter Knego
Oh that's not the link. I'll try to change it back.
Bob
The site is http://www.amazon.com/gp/offer-listing/0596158068/ref=dp_olp_used?ie=UTF8
Bob
Ok, for some reason thee site gets changed by the website. There's nothing I can do about that.
Bob