ansaurus

Question

Answer 1

+3 A:

The way I would approach this to start by attempting to fetch the same page using a web browser. If you cannot get that to work, it is probably safe to conclude that the real problem is with the server. You'll need to talk to the server's support staff.

If a browser works, try and repeat the process using the wget utility. If wget gives you problems, go back to your browser and find out exactly what headers the browser is sending in the HTTP request and try to get wget to use the same headers. Once you've got wget to work, make a note of the headers.

Finally return to your Java code, and modify it so that the HTTP request headers it sends are the same as those that work for wget.

Yes, I have to authenticate using the proxy of my university and then I am able to access all the data. The proxy authentication is working flawlessly for the 'journal page' and even for other sites, so I'd exclude that the problem is related to that.

I think you may have excluded the real problem. @BalasC is not talking about proxy authentication. Rather he is talking about authentication at the IEEE site. And just because one part of the site appears to work without authentication does not mean it all will. (However, I'd have thought that the site would respond with a "FORBIDDEN" or "AUTHORIZATION REQUIRED" error rather than delivering strange content.)

Another possibility is that the site trying to prevent "screen scraping" of their content using automatic tools. Check the "Terms of Service" for the site to see if what you are trying to do is allowed. (You may choose to ignore the ToS and circumvent the technical measures, but then you might find yourself or your organization IP blocked, or you might be on the end of cease-and-desist letters talking about copyright violation.)

Stephen C 2010-05-30 05:09:31

Thanks for your reply, I'll try it tomorrow and I'll let you know

mariosangiorgio 2010-05-30 19:58:58

Unfortunately even with wget I have the same output, does anybody have any idea on how could I get the data?With the browser I am able to see all the pages

mariosangiorgio 2010-05-31 22:42:23

*"If wget gives you problems, go back to your browser and find out exactly what headers the browser is sending in the HTTP request and try to get wget to use the same headers."* Did you try doing that? It is the key to solving your problem.

Stephen C 2010-05-31 22:52:19

Thank you so much, really appreciated. I did what you suggested and I fixed the bug. Thanks again!

mariosangiorgio 2010-05-31 22:55:51

You are welcome :-)

Stephen C 2010-05-31 23:53:55

Answer 2

A:

I found the solution to my problem, I was missing some header informations that apparently are required just from part of the dynamic page.

To solve my issue I first used wireshark to see the communications between the browser and the server and then I added all the headers I was missing.

I found out that in my case I needed to specify the 'Accept-Language' data

mariosangiorgio 2010-05-31 22:53:16

ansaurus

tags:

views:

answers:

Apache HTTPClient returns an empty page

related questions