views:

1047

answers:

2

Hi, I have made some adaptations to the script from this answer. and I am having problems with unicode. Some of the questions end up being written poorly.

Some answers and responses end up looking like:

Yeah.. I know.. I’m a simpleton.. So what’s a Singleton? (2)

How can I make the ’ to be translated to the right character?

Note: If that matters, I'm using python 2.6, on a French windows.

>>> sys.getdefaultencoding()
'ascii'
>>> sys.getfilesystemencoding()
'mbcs'


EDIT1: Based on Ryan Ginstrom's post, I have been able to correct a part of the output, but I am having problems with python's unicode.

In Idle / python shell:

Yeah.. I know.. I’m a simpleton.. So what’s a Singleton?

In a text file, when redirecting stdout

Yeah.. I know.. I’m a simpleton.. So what’s a Singleton?

How can I correct that ?


Edit2: I have tried Jarret Hardie's solution but it didn't do anything. I am on windows, using python 2.6, so my site-packages folder is at:

C:\Python26\Lib\site-packages

There was no siteconfig.py file, so I created one, pasted the code provided by Jarret Hardie, started a python interpreter, but seems like it has not been loaded.

sys.getdefaultencoding() 'ascii'

I noticed there is a site.py file at :

C:\Python26\Lib\site.py

I tried changing the encoding in the function

def setencoding():
    """Set the string encoding used by the Unicode implementation.  The
    default is 'ascii', but if you're willing to experiment, you can
    change this."""
    encoding = "ascii" # Default value set by _PyUnicode_Init()
    if 0:
        # Enable to support locale aware default string encodings.
        import locale
        loc = locale.getdefaultlocale()
        if loc[1]:
            encoding = loc[1]
    if 0:
        # Enable to switch off string to Unicode coercion and implicit
        # Unicode to string conversion.
        encoding = "undefined"
    if encoding != "ascii":
        # On Non-Unicode builds this will raise an AttributeError...
        sys.setdefaultencoding(encoding) # Needs Python Unicode build !

to set the encoding to utf-8. It worked (after a restart of python of course).

>>> sys.getdefaultencoding()
'utf-8'

The sad thing is that it didn't correct the caracters in my program. :(

+1  A: 

You should be able to convert HTMl/XML entities into Unicode characters. Check out this answer in SO:

http://stackoverflow.com/questions/628332/decoding-html-entities-with-python

Basically you want something like this:

from BeautifulSoup import BeautifulStoneSoup

soup = BeautifulStoneSoup(urllib2.urlopen(URL),
                          convertEntities=BeautifulStoneSoup.ALL_ENTITIES)
Ryan Ginstrom
A: 

Does changing your default encoding in siteconfig.py work?

In your site-packages file (on my OS X system it's in /Library/Python/2.5/site-packages/) create a file called siteconfig.py. In this file put:

import sys
sys.setdefaultencoding('utf-8')

The setdefaultencoding method is removed from the sys module once siteconfig.py is processed, so you must put it in site-packages so that Python will read it when the interpreter starts up.

Jarret Hardie