ansaurus

Question

Convert Unicode to String in Python (containing extra symbols)

Answer 1

+7 A:

title = u"Klüft skräms inför på fédéral électoral große"
import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii','ignore')
'Kluft skrams infor pa federal electoral groe'

Sorantis 2009-07-30 15:44:32

Thanks or the tip, I didn't know it was possible to do this with the standard library.

Bastien Léonard 2009-07-30 15:54:32

He said he wanted to SAVE it, not mangle it.

John Machin 2009-07-30 15:58:47

+1 answers the question as worded, @williamtroup's problem of not being able to save unicode to a file sounds like an entirely different issue worthy of a separate question

Mark Roddy 2009-07-30 16:03:49

@John - that answer predates the OP's clarification.

Dominic Rodger 2009-07-30 16:16:34

@Mark Roddy: His question as written is how to convert a "Unicode string" (whatever he means by that) containing some currency symbols to a "Python string" (whatever ...) and you think that a remove-some-diacritics delete-other-non-ascii characters kludge answers his question???

John Machin 2009-07-30 16:25:08

@Dominic: I'm very sorry; I'll rephrase that: The OP's unclarified question said he wanted to CONVERT it TO A PYTHON STRING, not mangle it.

John Machin 2009-07-30 17:19:04

Not sure if this answers the question, but +1 for introducing me to the unicodedata modules normalize method.

monkut 2009-07-31 01:37:18

Answer 2

A:

Here is an example:

>>> u = u'€€€'
>>> s = u.encode('utf8')
>>> s
'\xe2\x82\xac\xe2\x82\xac\xe2\x82\xac'

Bastien Léonard 2009-07-30 15:46:26

Answer 3

A:

Well, if you're willing/ready to switch to Python 3 (which you may not be due to the backwards incompatibility with some Python 2 code), you don't have to do any converting; all text in Python 3 is represented with Unicode strings, which also means that there's no more usage of the u'<text>' syntax. You also have what are, in effect, strings of bytes, which are used to represent data (which may be an encoded string).

http://docs.python.org/3.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

(Of course, if you're currently using Python 3, then the problem is likely something to do with how you're attempting to save the text to a file.)

JAB 2009-07-30 16:09:31

In Python 3 strings are Unicode strings. They are never encoded. I found the following text useful: http://www.joelonsoftware.com/articles/Unicode.html

2009-07-30 16:14:04

He wants to save it to a file; how does your answer help with that?

John Machin 2009-07-30 16:15:16

@lutz: Right, I'd forgotten that Unicode is a character map rather than an encoding. @John: There isn't enough information at the moment to know what the problem with saving it is. Is he getting an error? Is he not getting any errors, but when opening the file externally he gets mojibake? Without that information, there are far too many possible solutions that could be provided.

JAB 2009-07-30 16:24:04

@Cat: There isn't any information at the moment to know what he's got, let alone what his saving problem is. I've asked him to provide some facts -- see my answer.

John Machin 2009-07-30 16:35:30

Answer 4

+1 A:

We need to know what Python version you are using, and what it is that you are calling a Unicode string.

Do the following on a short unicode_string that includes the currency symbols that are causing the bother:

Python 2.x : print type(unicode_string), repr(unicode_string)

Python 3.x : print type(unicode_string), ascii(unicode_string)

Then edit your question and copy/paste the results of the above print statement. DON'T retype the results.

Also look up near the top of your HTML and see if you can find something like this:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

Tell us what yours says after charset=.

Then you stand a chance of getting meaningful answers.

John Machin 2009-07-30 16:13:26

The charset is currently at charset=utf-8

williamtroup 2009-07-31 07:03:30

Answer 5

+3 A:

If you have a unicode string, and you want to write this to a file, or other serialised form, you must first encode it into a particular representation that can be stored. There are several common unicode encodings, such as utf-16 (uses 2 bytes for most unicode characters) or utf-8 (1-4 bytes / codepoint depending on the character) etc. To convert that string into a particular encoding, you can use:

>>> s= u'£10"
>>> s.encode('utf8')
'\xc2\x9c10'
>>> s.encode('utf16')
'\xff\xfe\x9c\x001\x000\x00'

This raw string of bytes can be written to a file. However note that when reading it back, you must know what encoding it is in and decode it using that same encoding.

When writing to files, you can get rid of this manual encode / decode process by using the codecs module. So, to open a file that encodes all unicode strings into utf8, use:

import codecs
f = codecs.open('path/to/file.txt','w','utf8')
f.write(my_unicode_string)  # Stored on disk as UTF8

Do note that anything else that is using these files must understand what encoding the file is in if they want to read them. If you are the only one doing the reading/writing this is no problem, otherwise make sure that you write in a form understandable by whatever else uses the files.

In python 3, this form of file access is the default, and the builtin open function will take an encoding parameter and always translate to/from unicode strings (the default string object in python3) for files opened in text mode.

Brian 2009-07-30 16:44:54

Answer 6

+1 A:

You can use encode to ASCII if you don't need to translate the non ASCII chars:

>>> a=u"aaaàçççñññ"
>>> type(a)
<type 'unicode'>
>>> a.encode('ascii','ignore')
'aaa'
>>> a.encode('ascii','replace')
'aaa???????'
>>>

Ferran 2009-07-31 07:13:09

ansaurus

tags:

views:

answers:

Convert Unicode to String in Python (containing extra symbols)

related questions