views:

29

answers:

1

I am using urlfetch to fetch a URL. When I try to send it to html2text function (strips off all HTML tags), I get the following message:

UnicodeEncodeError: 'charmap' codec can't encode characters in position  ... character maps to <undefined>

I've been trying to process encode('UTF-8','ignore') on the string but I keep getting this error.

Any ideas?

Thanks,

Joel


Some Code:

result = urlfetch.fetch(url="http://www.google.com")
html2text(result.content.encode('utf-8', 'ignore'))

And the error message:

File "C:\Python26\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 159-165: character maps to <undefined>
+2  A: 

You need to decode the data you fetched first! With which codec? Depends on the website you fetch.

When you have unicode and try to encode it with some_unicode.encode('utf-8', 'ignore') i can't image how it could throw an error.

Ok what you need to do:

result = fetch('http://google.com') 
content_type = result.headers['Content-Type'] # figure out what you just fetched
ctype, charset = content_type.split(';')
encoding = charset[len(' charset='):] # get the encoding
print encoding # ie ISO-8859-1
utext = result.content.decode(encoding) # now you have unicode
text = utext.encode('utf8', 'ignore') # encode to uft8

This is not really robust but it should show you the way.

THC4k
Sorry, I meant decode.. my mistake!
Joel
do I know which codec do I need to use? Say for google.com
Joel
@Joel: The codec you need to decode with is either in the HTTP headers or in the HTML meta tag (or unspecified, then you have to guess). Google is a bad example for this, because the website you get depends on where you live :p
THC4k
Edited post with some code
Joel
Thanks its working!
Joel