ansaurus

Question

Need help in getting HTML of a website in Java

Answer 1

+7 A:

That site is incorrectly gzipping the response regardless of the client's capabilities. Normally a server should only gzip the response whenever the client supports it (by Accept-Encoding: gzip). You need to ungzip it using GZIPInputStream.

reader = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream()), "UTF-8"));

Note that I also added the right charset to the InputStreamReader constructor. Normally you'd like to extract it from the Content-Type header of the response.

For more hints, see also How to use URLConnection to fire and handle HTTP requests? If all what you after all want is parsing/extracting information from the HTML, then I strongly recommend to use a HTML parser like Jsoup instead.

BalusC 2010-08-04 14:06:46

Wow it worked. Thanks for the explanation. And a big thanks for the snippet as well.I initially tried using HTMLCleaner as my parser, but I was getting the same issue. Now I am going to feed this HTML string into HTMLCleaner.

bits 2010-08-04 14:20:06

You're welcome.

BalusC 2010-08-04 14:20:35

BTW, jsoup (1.3.1) now deals with that gzipped output correctly when using `Jsoup.connect(url).get();`

Jonathan Hedley 2010-08-23 10:20:50

ansaurus

tags:

views:

answers:

Need help in getting HTML of a website in Java

related questions