views:

219

answers:

2

I have a bunch of HTML files I downloaded using HTTPLIB2 package in Python. ' ' are showing as 'Â '.

<font color="#ff0000">02/12/2004Â </font> is showing while <font color="#ff0000">02/12/2004&nbsp;</font> is the desired format.

How do I replace the 'Â ' with '&nbsp;' in Python? Thanks a lot!

A: 
s.replace('Â ', '&nbsp;');

However, while I haven't used HTTPLIB2, I'm pretty sure something is wrong if the source of the HTML files is being changed when you download them. It may be that there's a decoding problem going on. What version of Python are you using? If it's Python 3, the contents will be byte sequences, not strings, so you'll have to specify the right codepage to decode the bytes to.

http://code.google.com/p/httplib2/wiki/ExamplesPython3

EDIT: If you aren't limited to using just httplib2, perhaps you could try looking into using the urllib, urllib2, or httplib modules that are part of the Python 2.6 standard library?

JAB
I am using Python 2.6..
ThinkCode
No go.. I get the following error : SyntaxError: Non-ASCII character '\xc3' in fileI used content.replace('Â ', ' ') in my python program.. Thanks..
ThinkCode
Since you're working with a version of Python 2, you may have to use a unicode string to hold 'Â '. I got into Python several months after 3 came out, so I've mainly had experience with that.
JAB
A: 
filtered_content = filter(lambda x: x in string.printable, content)

This solved my problem. Thank you!

ThinkCode
This worked for me with the same problem. Nice.
AP257