views:

108

answers:

3

Dear Python Experts,

I have written the following trial code to retreive the title of legislative acts from the European parliament.

import urllib2
from BeautifulSoup import BeautifulSoup

search_url = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-%.4d&language=EN"

for number in xrange(1,10):   
    url = search_url % number
    page = urllib2.urlopen(url).read()
    soup = BeautifulSoup(page)
    title = soup.findAll("title")
    print title

However, whenever I run it i get the following error:

Traceback (most recent call last):
  File "<stdin>", line 20, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 70: ordinal not in range(128)

I have narrowed it down to BeautifulSoup not being able to read the fourth document in the loop. Can anyone explain to me what I am doing wrong?

With kind regards

Thomas

+1  A: 

Replacing

print title

with

for t in title:
    print(t)

or

print('\n'.join(t.string for t in title))

works. I'm not entirely sure why print <somelist> sometimes works, and sometimes doesn't however.

unutbu
Dear Unutbu, thanks for the tips, both work for me. Weird...
Thomas Jensen
+3  A: 

BeautifulSoup works in Unicode, so it's not responsible for that decoding error. More likely, your problem comes with the print statement -- your standard output seems to be in ascii (i.e., sys.stdout.encoding = 'ascii' or absent) and therefore you would indeed get such errors if trying to print a string containing non-ascii characters.

What's your OS? How is your console AKA terminal set (e.g. if on Windows what "codepage")? Did you set in the environment PYTHONIOENCODING to control sys.stdout.encoding or are you just hoping the encoding will be picked up automatically?

On my Mac, where the encoding is correct detected, running your code (save for also printing the number together with each title, for clarity) works fine and shows:

$ python ebs.py 
1 [<title>REPORT Report on the proposal for a Council regulation temporarily suspending autonomous Common Customs Tariff duties on imports of certain industrial products into the autonomous regions of Madeira and the Azores - A7-0001/2010</title>]
2 [<title>REPORT Report on the proposal for a Council directive concerning mutual assistance for the recovery of claims relating to taxes, duties and other measures - A7-0002/2010</title>]
3 [<title>REPORT Report on the proposal for a regulation of the European Parliament and of the Council amending Council Regulation (EC) No 1085/2006 of 17 July 2006 establishing an Instrument for Pre-Accession Assistance (IPA) - A7-0003/2010</title>]
4 [<title>REPORT on equality between women and men in the European Union – 2009 - A7-0004/2010</title>]
5 [<title>REPORT Report on the proposal for a Council decision on the conclusion by the European Community of the Convention on the International Recovery of Child Support and Other Forms of Family Maintenance - A7-0005/2010</title>]
6 [<title>REPORT on the proposal for a Council directive on administrative cooperation in the field of taxation - A7-0006/2010</title>]
7 [<title>REPORT Report on promoting good governance in tax matters - A7-0007/2010</title>]
8 [<title>REPORT Report on the proposal for a Council Directive amending Directive 2006/112/EC as regards an optional and temporary application of the reverse charge mechanism in relation to supplies of certain goods and services susceptible to fraud - A7-0008/2010</title>]
9 [<title>REPORT Recommendation on the proposal for a Council decision concerning the conclusion, on behalf of the European Community, of the Additional Protocol to the Cooperation Agreement for the Protection of the Coasts and Waters of the North-East Atlantic against Pollution - A7-0009/2010</title>]
$ 
Alex Martelli
Hi Alex, I do indeed use a Mac, how have you setup yours? Right now I am just hoping the encoding will be picked up automatically (I am still learning about this whole confusing encoding business:))
Thomas Jensen
@Thomas, I haven't done any setup -- works out of the box (utf8 is the default for Terminal.App, I believe -- if not, then that's the only thing I have set in Terminal's preferences). What's `sys.stdout.encoding` in your Python (indeed, what's your Python and MacOSX? I have OSX 10.5 and it works with Apple-distributed Python 2.5, and python.org-distributed 2.4, 2.6 and 3.1 -- all out of the box and no environment variable setting).
Alex Martelli
Hi alex, I am using MacOSx 10.5.8 and python 2.6.
Thomas Jensen
Just to add to the above: I use aquamacs as my editor
Thomas Jensen
And what does sys.stdout.encoding show in a Py2.6 session? (I don't see how the editor changes things -- or are you running your python code from _inside_ your editor, perchance, rather than on a normal Terminal.App?).
Alex Martelli
If i run sys.std.encoding i get "us-ascii". Thanks for spending your time with helping me Alex!
Thomas Jensen
FWIW, I can reproduce the problem Thomas describes under Ubuntu 9.10, python 2.6.4, and with `sys.stdout.encoding` UTF-8.
unutbu
@Thomas, `us-ascii` (presumably in `sys.stdout.encoding` -- not `sys.std.encoding` as you write) would surely cause horrible problems -- how's the Terminal.App set (assuming that's the output destination)? If utf8 as it should be, not sure why it's not being picked p -- have you tried setting `PYTHONIOENCODING` in the environment?
Alex Martelli
@~unutbu, what terminal emulator, and how set? And is PYTHONIOENCODING set, and, if so, how?
Alex Martelli
@Alex: I get the UnicodeEncodeError with `PYTHONIOENCODING` set to `UTF-8` and also if unset (the default). My terminal emulator is GNOME Terminal 2.28.1 (the default with Ubuntu 9.10). The terminal menu item Terminal>"Set Character Encoding" is set to "Unicode (UTF-8)". Here is the terminal log: http://paste.ubuntu.com/458752/ .
unutbu
A: 

If you want to print the titles to a file, you need to specify some encoding that can represent the non-ascii char, utf8 should work fine. To do this, you need to add:

out = codecs.open('titles.txt', 'w', 'utf8')

at the top of the script

and print to the file:

print >> out, title
Hi Maltjuv, thanks for the help, but it still gives me the same error.
Thomas Jensen