tags:

views:

444

answers:

3

Hi,

I'm using the Python Shell in this way:

>>> s = 'Ã'
>>> s
'\xc3'

How can I print s variable to show the character Ã??? This is the first and easiest question. Really, I'm getting the content from a web page that has non ascii characters like the previous and others with tilde like á, é, í, ñ, etc. Also, I'm trying to execute a regex with these characters in the pattern expression against the content of the web page.

How can solve this problem??

This is an example of one regex:

u'<td[^>]*>\s*Definición\s*</td><td class="value"[^>]*>\s*(?P<data>[\w ,-:\.\(\)]+)\s*</td>'

If I use Expresson application works fine.

EDIT[05/26/2009 16:38]: Sorry, about my explanation. I'll try to explain better.

I have to get some text from a page. I have the url of that page and I have the regex to get that text. The first thing I thought was the regex was wrong. I checked it with Expresso and works fine, I got the text I wanted. So, the second thing I thought was to print the content of the page and that was when I saw that the content was not what I see in the source code of the web page. The differences are the non ascii characters like á, é, í, etc. Now, I don't know what I have to do and if the problem is in the encoding of the page content or in the pattern text of the regex. One of the regex I've defined is the previous one.

The question wolud be: is there any problem using regex which pattern text has non ascii characters???

+2  A: 

How can I print s variable to show the character Ã???
use print:

>>> s = 'Ã'
>>> s
'\xc3'
>>> print s
Ã
jcoon
Apparently he or she can't know in advance the encoding, so I think it should be converted to Unicode first (see my answer).
Bastien Léonard
It works, but how can I do it if I get the content of a web page in this way?:def getUrlContent(url): """ Gets the html content of an url """ socket = urllib2.urlopen(url).fp html = urllib.unquote(socket.read()) socket.close() return html
jaloplo
+1  A: 

I would use ord() to find out if a character is ASCII/special:

if ord(c) > 127:
    # special character

This probably won't work with multibyte encodings such as UTF-8. In this case, I would convert to Unicode before testing.

If you get special characters from a web page, you should know the encoding. Then decode it, see Unicode HOWTO.

Edit: I'm definitely not sure what this question is about... It may be a good idea to clarify it.

Bastien Léonard
How can I know the encoding of a web page?
jaloplo
that's not so trivial, when the html does not explicitly states it's encoding. however there are tools to guess the encoding, e.g. jchardet: http://jchardet.sourceforge.net/; another bruteforce method is to iterate over all encodings provided by the ``iconv`` utility.
The MYYN
+2  A: 

Suppose you want to print it as utf-8. Before python 3, the best is to specifically encode it

print u'Ã'.encode('utf-8')

if you get the text externally then you have to specifically decode('utf-8) such as

f = open(my_file)
a = f.next().decode('utf-8') # you have a unicode line in a
print a.encode('utf-8')
odwl