views:

119

answers:

2

It's failing with this when I run eclipse or when I run my script in iPython:

'ascii' codec can't decode byte 0xe2 in position 32: ordinal not in range(128) 

I don't know why, but when I simply execute the feedparse.parse(url) statement using the same url, there is no error thrown. This is stumping me big time.

The code is as simple as:

      try:
           d = feedparser.parse(url)
      except Exception, e:
           logging.error('Error while retrieving feed.')
           logging.error(e)
           logging.error(formatExceptionInfo(None))
           logging.error(formatExceptionInfo1())

Here is the stack trace:

d = feedparser.parse(url)


 File "C:\Python26\lib\site-packages\feedparser.py", line 2623, in parse
    feedparser.feed(data)
  File "C:\Python26\lib\site-packages\feedparser.py", line 1441, in feed
    sgmllib.SGMLParser.feed(self, data)
  File "C:\Python26\lib\sgmllib.py", line 104, in feed
    self.goahead(0)
  File "C:\Python26\lib\sgmllib.py", line 143, in goahead
    k = self.parse_endtag(i)
  File "C:\Python26\lib\sgmllib.py", line 320, in parse_endtag
    self.finish_endtag(tag)
  File "C:\Python26\lib\sgmllib.py", line 360, in finish_endtag
    self.unknown_endtag(tag)
  File "C:\Python26\lib\site-packages\feedparser.py", line 476, in unknown_endtag
    method()
  File "C:\Python26\lib\site-packages\feedparser.py", line 1318, in _end_content
    value = self.popContent('content')
  File "C:\Python26\lib\site-packages\feedparser.py", line 700, in popContent
    value = self.pop(tag)
  File "C:\Python26\lib\site-packages\feedparser.py", line 641, in pop
    output = _resolveRelativeURIs(output, self.baseuri, self.encoding)
  File "C:\Python26\lib\site-packages\feedparser.py", line 1594, in _resolveRelativeURIs
    p.feed(htmlSource)
  File "C:\Python26\lib\site-packages\feedparser.py", line 1441, in feed
    sgmllib.SGMLParser.feed(self, data)
  File "C:\Python26\lib\sgmllib.py", line 104, in feed
    self.goahead(0)
  File "C:\Python26\lib\sgmllib.py", line 138, in goahead
    k = self.parse_starttag(i)
  File "C:\Python26\lib\sgmllib.py", line 296, in parse_starttag
    self.finish_starttag(tag, attrs)
  File "C:\Python26\lib\sgmllib.py", line 338, in finish_starttag
    self.unknown_starttag(tag, attrs)
  File "C:\Python26\lib\site-packages\feedparser.py", line 1588, in unknown_starttag
    attrs = [(key, ((tag, key) in self.relative_uris) and self.resolveURI(value) or value) for key, value in attrs]
  File "C:\Python26\lib\site-packages\feedparser.py", line 1584, in resolveURI
    return _urljoin(self.baseuri, uri)
  File "C:\Python26\lib\site-packages\feedparser.py", line 286, in _urljoin
    return urlparse.urljoin(base, uri)
  File "C:\Python26\lib\urlparse.py", line 215, in urljoin
    params, query, fragment))
  File "C:\Python26\lib\urlparse.py", line 184, in urlunparse
    return urlunsplit((scheme, netloc, url, query, fragment))
  File "C:\Python26\lib\urlparse.py", line 192, in urlunsplit
    url = scheme + ':' + url
  File "C:\Python26\lib\encodings\cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)

PARTIALLY SOLVED:

This is reproducable when the URL being passed to feedparser.parse() is unicode. It won't repro when it's an ascii URL. And for the record, you need a feed that has some high character unicode characters. I am not sure why this is.

+1  A: 

Looks like the url that is giving you problem contains text with some encoding (such as latin-1, where 0xe2 would be "lowercase a with a circle on top" aka â) without a proper content-type header (it should have a charset= parameter in Content-Type: but doesn't).

If that is the case feedparser cannot guess the encoding, tries the default (ascii), and fails.

this part of feedparser's docs explains the issues in more detail.

Unfortunately there are no "magic bullets" to solve this general issue (due to bozos that break the XML rules). You could try catching this exception, and in the handler read the url's contents separately (use urllib2) and try decoding them with various possible encodings -- then when you finally get a usable unicode object this way, feed that to feedparser.parse (whose first arg can be a url, a file stream, or a unicode string with the data).

Alex Martelli
`E2` could also be the first byte of a UTF-8 encoded character. For example, the left curly-quote (U+2018) is `E2 80 98` in UTF-8.
Alan Moore
The error is occurring in the `cp1252` codec -- looks like it's attempting to decode a unicode object.
John Machin
Why is it only happening when the url string itself is unicode? Why should that matter? When the url string is ascii, the problem is gone. I don't see what encoding the URL string itself has to do with the document parsing and encoding.
Rhubarb
@Rhubarb: we don't know why it is only happening when the url string itself is whatever you are calling "unicode", and we'll never know if you don't tell us what the ferschlugginer url string is (type and repr).
John Machin
@Rhubarb, URLs must be byte strings. So, `feedparser.parse(url.encode('latin-1'))` or the like might solve your problem.
Alex Martelli
@John Machin. Try any url literal, such as u'http://myfeed.blah/xml'It should reproduce.
Rhubarb
+1  A: 

With reference to the OP's comment: Try any url literal, such as u'myfeed.blah/xml' It should reproduce.

>>> from pprint import pprint as pp
>>> import feedparser

>>> d = feedparser.parse(u'myfeed.blah/xml')
>>> pp(d)
{'bozo': 1,
 'bozo_exception': SAXParseException('not well-formed (invalid token)',),
 'encoding': 'utf-8',
 'entries': [],
 'feed': {},
 'namespaces': {},
 'version': ''}

>>> d = feedparser.parse(u'http://myfeed.blah/xml')
>>> pp(d)
{'bozo': 1,
 'bozo_exception': URLError(gaierror(11001, 'getaddrinfo failed'),),
 'encoding': 'utf-8',
 'entries': [],
 'feed': {},
 'version': None}

>>> d = feedparser.parse("http://feedparser.org/docs/examples/atom10.xml")
>>> d['bozo']
0
>>> d['feed']['title']
u'Sample Feed'

>>> d = feedparser.parse(u"http://feedparser.org/docs/examples/atom10.xml")
>>> d['bozo']
0
>>> d['feed']['title']
u'Sample Feed'
>>>

Please stop thrashing about; provide a URL that actually causes the problem.

John Machin