views:

257

answers:

3

Hello:

I am having some trouble with a very basic string issue in Python (that I can't figure out). Basically, I am trying to do the following:

'# read file into a string 
myString =  file.read()

'# Attempt to remove non breaking spaces 
myString = myString.replace("\u00A0"," ")

'# however, when I print my string to output to console, I get: 
Foo **<C2><A0>** Bar

I thought that the "\u00A0" was the escape code for unicode non breaking spaces, but apparently I am not doing this properly. Any ideas on what I am doing wrong?

A: 

There is no indication in what you write that you're necessarily doing anything wrong: if the original string had a non-breaking space between 'Foo' and 'Bar', you now have a normal space there instead. This assumes that at some point you've decoded your input string (which I imagine is a bytestring, unless you're on Python 3 or file was opened with the function from the codecs module) into a Unicode string, else you're unlikely to locate a unicode character in a non-unicode string of bytes, for the purposes of the replace. But still, there are no clear indications of problems in what you write.

Can you clarify what's the input (print repr(myString) just before the replace) and what's the output (print repr(myString) again just after the replace) and why you think that's a problem? Without the repr, strings that are actually different might look the same, but repr helps there.

Alex Martelli
+2  A: 

No, u"\u00A0" is the escape code for non-breaking spaces. "\u00A0" is 6 characters that are not any sort of escape code. Read this.

Ignacio Vazquez-Abrams
Thanks for that link Ignacio!
dontsaythekidsname
The link you provided might be good for a beginner but it is misleading. It completely neglects Unicode normalization e.g., `'ć'` is `u'\u0107'` and it could be represented as `u'c\u0301'` http://unicode.org/reports/tr15/
J.F. Sebastian
+2  A: 

You don't have a unicode string, but a UTF-8 list of bytes (which are what strings are in Python 2.x).

Try

myString = myString.replace("\xc2\xa0", " ")

Better would be two switch to unicode -- see this article for ideas. Thus you could say

uniString = unicode(myString, "UTF-8")
uniString = uniString.replace(u"\u00A0", " ")

and it should also work (caveat: I don't have Python 2.x available right now), although you will need to translate it back to bytes (binary) when sending it to a file or printing it to a screen.

Kathy Van Stone