views:

1855

answers:

5
html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)

Traceback (most recent call last):
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
    handler.get(*groups)
  File "/Users/greg/clounce/main.py", line 55, in get
    html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)
+2  A: 
>>> u'aあä'.encode('ascii', 'ignore')
'a'

EDIT:

Decode the string you get back, using either the charset in the the appropriate meta tag in the response or in the Content-Type header, then encode.

Ignacio Vazquez-Abrams
No, you didn't, because you don't have a unicode like my example shows.
Ignacio Vazquez-Abrams
A bytestring (`str`). http://farmdev.com/talks/unicode/
Ignacio Vazquez-Abrams
OK, my problem changed and I simplified the explanation.
marienbad
A: 

If you have a string line, you can use the .encode([encoding], [errors='strict']) method for strings to convert encoding types.

line = 'my big string'

line.encode('ascii', 'ignore')

For more information about handling ASCII and unicode in Python, this is a really useful site: http://www.amk.ca/python/howto/unicode

Jama22
+2  A: 

I use this helper function throughout all of my projects. If it can't convert the unicode, it ignores it. This ties into a django library, but with a little research you could bypass it.

from django.utils import encoding

def convert_unicode_to_string(x):
    """
    >>> convert_unicode_to_string(u'ni\xf1era')
    'niera'
    """
    return encoding.smart_str(x, encoding='ascii', errors='ignore')

I no longer get any unicode errors after using this.

Gattster
That is SUPPRESSING the problem, not diagnosing and fixing. It's like saying "After I cut my feet off, I no longer have problems with corns and bunions".
John Machin
I agree it's suppressing the problem. It seems like that is what the question is after though. Look at his note: "Can I just drop whatever code bytes are causing the problem instead of getting an error?"
Gattster
And the answer to that kind of question should be a resounding **NO!!** He already has an error, ignoring it is worse!!
John Machin
+3  A: 

You wrote """I assume that means the HTML contains some wrongly-formed attempt at unicode somewhere."""

The HTML is NOT expected to contain any kind of "attempt at unicode", well-formed or not. It must of necessity contain Unicode characters encoded in some encoding, which is usually supplied up front ... look for "charset".

You appear to be assuming that the charset is UTF-8 ... on what grounds? The "\xA0" byte that is shown in your error message indicates that you may have a single-byte charset e.g. cp1252.

If you can't get any sense out of the declaration at the start of the HTML, try using chardet to find out what the likely encoding is.

Why have you tagged your question with "regex"?

Update after you replaced your whole question with a non-question:

html = urllib.urlopen(link).read()
# html refers to a str object. To get unicode, you need to find out
# how it is encoded, and decode it.

html.encode("utf8","ignore")
# problem 1: will fail because html is a str object;
# encode works on unicode objects so Python tries to decode it using 
# 'ascii' and fails
# problem 2: even if it worked, the result will be ignored; it doesn't 
# update html in situ, it returns a function result.
# problem 3: "ignore" with UTF-n: any valid unicode object 
# should be encodable in UTF-n; error implies end of the world,
# don't try to ignore it. Don't just whack in "ignore" willy-nilly,
# put it in only with a comment explaining your very cogent reasons for doing so.
# "ignore" with most other encodings: error implies that you are mistaken
# in your choice of encoding -- same advice as for UTF-n :-)
# "ignore" with decode latin1 aka iso-8859-1: error implies end of the world.
# Irrespective of error or not, you are probably mistaken
# (needing e.g. cp1252 or even cp850 instead) ;-)
John Machin
+2  A: 

Can we get the actual value used for link?

In addition, we usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in

html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")

As an example:

html = '\xa0'
encoded_str = html.encode("utf8")

Fails with

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

While:

html = '\xa0'
decoded_str = html.decode("windows-1252")
encoded_str = html.encode("utf8")

Succeeds without error. Do note that "windows-1252" is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.

Another problem I see there is that the .encode() string method returns the modified string and does not modify the source in place. So it's kind of useless to have self.response.out.write(html) as html is not the encoded string from html.encode (if that is what you were originally aiming for).

As Ignacio suggested, check the source webpage for the actual encoding of the returned string from read(). It's either in one of the Meta tags or in the ContentType header in the response. Use that then as the parameter for .decode().

Do note however that it should not be assumed that other developers are responsible enough to make sure the header and/or meta character set declarations match the actual content. (Which is a PITA, yeah, I should know, I was one of those before).

Vin-G