views:

118

answers:

2

Hi, I am using the Apache HTTPClient for Java and I'm facing a really strange issue. Sometimes when I try to get a dynamically generated page it returns its actual content, but other times (with another parameter) all I get is a short sequence of \t,\r and \n.

How could I track what's going on on the different cases in order to find where is the bug?

My usage of the library is pretty straightforward, all I do is this few calls on an initialized HTTPClient object:

String content = "/pageIwant.jsp?parameter=10101010";
HttpGet request = new HttpGet(content);
HttpResponse response = client.execute(targetHost, request);
HttpEntity entity = response.getEntity();
String page = EntityUtils.toString(entity);
+3  A: 

The way I would approach this to start by attempting to fetch the same page using a web browser. If you cannot get that to work, it is probably safe to conclude that the real problem is with the server. You'll need to talk to the server's support staff.

If a browser works, try and repeat the process using the wget utility. If wget gives you problems, go back to your browser and find out exactly what headers the browser is sending in the HTTP request and try to get wget to use the same headers. Once you've got wget to work, make a note of the headers.

Finally return to your Java code, and modify it so that the HTTP request headers it sends are the same as those that work for wget.

Yes, I have to authenticate using the proxy of my university and then I am able to access all the data. The proxy authentication is working flawlessly for the 'journal page' and even for other sites, so I'd exclude that the problem is related to that.

I think you may have excluded the real problem. @BalasC is not talking about proxy authentication. Rather he is talking about authentication at the IEEE site. And just because one part of the site appears to work without authentication does not mean it all will. (However, I'd have thought that the site would respond with a "FORBIDDEN" or "AUTHORIZATION REQUIRED" error rather than delivering strange content.)

Another possibility is that the site trying to prevent "screen scraping" of their content using automatic tools. Check the "Terms of Service" for the site to see if what you are trying to do is allowed. (You may choose to ignore the ToS and circumvent the technical measures, but then you might find yourself or your organization IP blocked, or you might be on the end of cease-and-desist letters talking about copyright violation.)

Stephen C
Thanks for your reply, I'll try it tomorrow and I'll let you know
mariosangiorgio
Unfortunately even with wget I have the same output, does anybody have any idea on how could I get the data?With the browser I am able to see all the pages
mariosangiorgio
*"If wget gives you problems, go back to your browser and find out exactly what headers the browser is sending in the HTTP request and try to get wget to use the same headers."* Did you try doing that? It is the key to solving your problem.
Stephen C
Thank you so much, really appreciated. I did what you suggested and I fixed the bug. Thanks again!
mariosangiorgio
You are welcome :-)
Stephen C
A: 

I found the solution to my problem, I was missing some header informations that apparently are required just from part of the dynamic page.

To solve my issue I first used wireshark to see the communications between the browser and the server and then I added all the headers I was missing.

I found out that in my case I needed to specify the 'Accept-Language' data

mariosangiorgio