tags:

views:

76

answers:

2

I use the following python code to download web pages from servers with gzip compression:

url = "http://www.v-gn.de/wbb/"
import urllib2
request = urllib2.Request(url)
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
content = response.read()
response.close()

import gzip
from StringIO import StringIO
html = gzip.GzipFile(fileobj=StringIO(content)).read()

This works generally, but for the specified URL fails with a struct.error exception. I get a similar result if I use wget with an "Accept-encoding" header. However, browsers seem to be able to decompress the response.

So my question is: is there a way I can get my python code to decompress the HTTP response without resorting to disabling compression by removing the "Accept-encoding" header?

For completeness, here's the line I use for wget:

wget --user-agent="Mozilla" --header="Accept-Encoding: gzip,deflate" http://www.v-gn.de/wbb/
+2  A: 

I ran the command you specified. It downloaded a gzip-ed data into index.html. I renamed index.html into index.html.gz. I tried gzip -d inedx.html.gz which lead to an error: gzip: index.html.gz: unexpected end of file.

Second try was zcat index.html.gz which worked fine except that after the </html> tag it printed the same error as above.

$ zcat index.html.gz
...
  </td>
 </tr>
</table>


</body>
</html>
gzip: index.html.gz: unexpected end of file

The server is faulty.

Notinlist
+3  A: 

It appears you can call readline() on the gzip.GzipFile object, but read() raises a struct.error because the file ends abruptly.

Since readline works (except at the very end), you could do something like this:

import urllib2
import StringIO
import gzip
import struct

url = "http://www.v-gn.de/wbb/"
request = urllib2.Request(url)
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
content = response.read()
response.close()
fh=StringIO.StringIO(content)
html = gzip.GzipFile(fileobj=StringIO.StringIO(content))
try:
    for line in html:
        line=line.rstrip()
        print(line)
except struct.error:
    pass
unutbu