tags:

views:

1352

answers:

7

I have a string that looks like so:

6 918 417 712

The clear cut way to trim this string (as I understand Python) is simply to

say the string is in a variable called s, we get:

s.replace('Â ', '')

That should do the trick. But of course it complains that the Non-ASCII character '\xc2' in file blabla.py is not encoded.

I never quite could understand how to switch between different encodings.

Appreciate some help.

EDIT:

Here's the code, it really is just the same as above, but now it's in context.

The file is saved as UTF-8 in notepad. The file has the following header:

#!/usr/bin/python2.4
# -*- coding: utf-8 -*-

The code:

f = urllib.urlopen(url)

soup = BeautifulSoup(f)

s = soup.find('div', {'id':'main_count'})

#making a print 's' here goes well. it shows 6Â 918Â 417Â 712

s.replace('Â ','')

save_main_count(s)


it gets no further than s.replace...

+1  A: 
s.replace(u'Â ', '')              # u before string is important

and make your .py file unicode.

SilentGhost
+6  A: 
>>> unicode_string = u"hello aåbäcö"
>>> unicode_string.encode("ascii", "ignore")
'hello abc'
truppo
I see the votes you get but when I try it it says:Nope.UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128).Could it be that my orignal string is not in unicode?Well in any case. it needs
adergaard
+3  A: 
#!/usr/bin/env python
# -*- coding: utf-8 -*-

s = u"6Â 918Â 417Â 712"
s = s.replace(u"Â", "") 
print s

This will print out 6 918 417 712

DoR
Nope.UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128).Could it be that my orignal string is not in unicode?Well in any case. I'm probably doing something wrong.
adergaard
@adergaard, did you add # -*- coding: utf-8 -*- at the top of the source file?
Nadia Alramli
Yes, see the top of this page again, I've edited the questoin and put in the code and the header comments. Thanks for your assistance.
adergaard
I think you will have to figure out how to get the strings from the html or xml document in unicode. More info on that here: http://diveintopython.org/xml_processing/unicode.html
DoR
+8  A: 
  • You need a declaration at the top of each source file that uses Unicode literals.

See: http://docs.python.org/tutorial/interpreter.html#source-code-encoding

Assuming utf8, this would go at the top:

# -*- coding: utf-8 -*-
  • The source file must be saved using the correct encoding in your text editor as well.

  • The literal must have a u before it, as in s.replace(u"Â ", "")

  • The string s must be a unicode string as well. BeautifulSoup might not be returning unicode here. Try s = s.decode('utf-8')

  • string.replace returns a new string and does not edit in place, so make sure you're using the return value as well

Jason S
Thank you very much. This finally did the trick.Thanks ALL for this very helpful forum. Fantastic.
adergaard
+7  A: 
def removeNonAscii(s): return "".join(filter(lambda x: ord(x)<128, s))

edit: my first impulse is always to use a filter, but the generator expression is more memory efficient (and shorter)...

def removeNonAscii(s): return "".join(i for i in s if ord(i)<128)
fortran
Let me just say that this is an excellent answer as well!This will save me a lot of time in the future.Thanks very much. I wish there was some way of accepting TWO answers.
adergaard
you're welcome :-)
fortran
A: 

This is a dirty hack, but may work.

s2 = ""
for i in s:
    if ord(i) < 128:
        s2 += i
Corey D
A: 

Using Regex:

import re

strip_unicode = re.compile("([^-_a-zA-Z0-9!@#%&=,/'\";:~`\$\^\*\(\)\+\[\]\.\{\}\|\?\<\>\\]+|[^\s]+)")
print strip_unicode.sub('', u'6Â 918Â 417Â 712')
Akoi Meexx
Did I seriously get voted down for this? Feel the love...
Akoi Meexx