ansaurus

Question

What's the fastest way to strip and replace a document of high unicode characters using Python?

Answer 1

+5 A:

.

# -*- encoding: utf-8 -*-
import unicodedata

def shoehorn_unicode_into_ascii(s):
    return unicodedata.normalize('NFKD', s).encode('ascii','ignore')

if __name__=='__main__':
    s = u"éèêàùçÇ"
    print(shoehorn_unicode_into_ascii(s))
    # eeeaucC

Note, as @Mark Tolonen kindly points out, the method above removes some characters like ß‘’“”. If the above code truncates characters that you wish translated, then you may have to use the string's translate method to manually fix these problems. Another option is to use unidecode (see J.F. Sebastian's answer).

When you have a large unicode string, using its translate method will be much much faster than using the replace method.

Edit: unidecode has a more complete mapping of unicode codepoints to ascii. However, unidecode.unidecode loops through the string character-by-character (in a Python loop), which is slower than using the translate method.

The following helper function uses unidecode's data files, and the translate method to attain better speed, especially for long strings.

In my tests on 1-6 MB text files, using ascii_map is about 4-6 times faster than unidecode.unidecode.

# -*- coding: utf-8 -*-
import unidecode
def ascii_map():
    data={}
    for num in range(256):
        h=num
        filename='x{num:02x}'.format(num=num)
        try:
            mod = __import__('unidecode.'+filename,
                             fromlist=True)
        except ImportError:
            pass
        else:
            for l,val in enumerate(mod.data):
                i=h<<8
                i+=l
                if i >= 0x80:
                    data[i]=unicode(val)
    return data

if __name__=='__main__':
    s = u"éèêàùçÇ"
    print(s.translate(ascii_map()))
    # eeeaucC

Edit2: Rhubarb, if # -*- encoding: utf-8 -*- is causing a SyntaxError, try # -*- encoding: cp1252 -*-. What encoding to declare depends on what encoding your text editor uses to save the file. Linux tends to use utf-8, and (it seems perhaps) Windows tends to cp1252.

unutbu 2010-05-18 02:39:36

now thats the proper way to do it

Claudiu 2010-05-18 02:47:28

I take it that the .encode to ascii is optional right?

Rhubarb 2010-05-18 02:51:57

No, NFKD normalization breaks characters such as é down into an e and a combining accent. Encoding to ascii with ignore leaves the e and removes the non-ASCII combining accent. The problem is, not all non-ASCII characters have a decomposed from consisting of ASCII and combining characters, so characters like ß‘’“” are just deleted using the ignore.

Mark Tolonen 2010-05-18 07:04:23

The problem I have now is that the error is occurring inside another library that I don't have source for.

Rhubarb 2010-05-18 12:49:33

@Rhubarb: If the current error is unrelated to this question, how about posting a new question? It will give you an opportunity to describe the problem in more detail, and more eyes will see your post.

unutbu 2010-05-18 16:50:25

@Rhubarb: Oops... It looks like you've already done just that.

unutbu 2010-05-18 16:53:02

@~unutbu, re recent edit: (1) an already-made solution was given in my answer (2) unicode.translate CAN produce more than one character ... """translation table, which must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None""". `u"Gau\xdf".translate({0xdf: u"ss"})` produces `u'Gauss'`

John Machin 2010-05-20 22:19:35

@~unutbu: `Unidecode` might be the solution that addresses all the unicode-to-ascii issues comprehensively http://stackoverflow.com/questions/2854230/whats-the-fastest-way-to-strip-and-replace-a-document-of-high-unicode-characters/2876950#2876950

J.F. Sebastian 2010-05-21 16:46:52

@J.F. Sebastian: Wow, thanks for bringing that to my attention.

unutbu 2010-05-21 17:04:57

@unutbu, I tried your updated code, but received this error:SyntaxError: Non-ASCII character '\xe9' in file t.py on line 21, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details (t.py, line21)Line 21 refers to: s = u"éèêàùçÇ"

Rhubarb 2010-05-25 04:32:59

@unutbu: Provide complete runnable code to novices. @Rhubarb: So fix the problem, then: declare an encoding, for example `# -*- encoding: utf-8 -*-` as the first line of your script. Also find out why by reading the reference given in the error message.

John Machin 2010-05-25 09:55:45

Ok thanks John and unutbu, I got the solution working. Surprisingly though it strips the string of the fancy quotes, and leaves it without normal ascii quotes. This string: “fancy quotes″ just comes back 'fancy quotes'.

Rhubarb 2010-05-25 12:44:34

@Rhubarb: If you are surprised, then you haven't been reading my answer and my comments. Anything that you want translated needs to be in the translate table, with the translation that you desire -- it's under YOUR control. You have access to the source of the various routines, which you can hack at will.

John Machin 2010-05-25 12:51:14

@John, I thought the point was that unidecode is providing a full set of translate tables, which would include fancy left and right double quotes. I guess I am missing something here pretty big.

Rhubarb 2010-05-25 13:10:32

@Rhubarb: Please edit your original post to include the code that shows the problem. I'm not able to reproduce it. When I use `ascii_map`, “fancy quotes” becomes "fancy quotes" (ascii quotes present)

unutbu 2010-05-25 13:18:26

@unutbu, edited, see above. I am obviously missing something really big here and it must be staring me right in the face.

Rhubarb 2010-05-25 13:22:15

@Rhubarb: Change `# -*- coding: iso-8859-15 -*-` to `# -*- coding: utf-8 -*-`. The fancy quote characters are not in the iso-8859-15 encoding, and this is causing Python to misinterpret your script.

unutbu 2010-05-25 14:07:37

@unutbu, doing that results in: SyntaxError: (unicode error) 'utf8' codec can't decode bytes in position 0-2: invalid data. I wonder why I'm experiencing this differently, if utf8 works for you.

Rhubarb 2010-05-25 14:21:31

@Rhubarb: I'm not sure what is the cause of the difference (`# -*- coding: utf-8 -*-` working for me, but not for you.) To help me research the problem, could you tell us what operating system and version of Python you are using, and if you are using and IDE?

unutbu 2010-05-25 14:39:17

@Rhubarb: I've added some alternative code to my answer. It avoids the need to prefix the script with `# -*- coding: utf-8 -*-`. Maybe see if that works for you.

unutbu 2010-05-25 14:50:27

unutbu, for what it's worth, I've recreated Rhubarb's problem using your example, even by reading in the file instead of having the string in the script. I am running on Windows XP SP2, Python 2.6, and using unidecode 0.4.1. What versions of these items are you using?

Leeks and Leaks 2010-05-25 18:08:39

@Leeks: I'm using Ubuntu 9.10, Python 2.6, unidecode 0.4.3.

unutbu 2010-05-25 18:43:55

@~unutbu, I wonder if the issue is an older unidecode version. Anyone else able to reproduce this problem?

Leeks and Leaks 2010-05-25 18:59:59

@Leeks: I could be wrong, but I don't really think the problem is in unidecode. I think the problem is that I don't know what encoding Windows is using to encode the script and/or the textfile. Perhaps it is cp1252. Would you please try changing `# -*- coding: utf-8 -*-` to `# -*- coding: cp1252 -*-`?

unutbu 2010-05-25 19:37:04

@unutbu, that worked. Pretty good stuff here. I wonder why unidecode isn't part of python to begin with?

Leeks and Leaks 2010-05-25 20:10:25

@Leeks. I'm glad cp1252 worked. I think the reason why unidecode isn't part of the standard library is because transliteration is an ugly business. Some might argue it is wrong-headed to begin with. (é is not e!). Moreover, `unidecode` is not perfect. See the warnings given by the author of Text::Unidecode, the module upon which unidecode is based: http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm

unutbu 2010-05-25 20:38:06

Hey it works, thanks!

Rhubarb 2010-05-27 02:09:42

Well it seems it doesn't work for u'\2033'...

Rhubarb 2010-06-10 23:15:59

@Rhubarb: Indeed. u'\2033' is u'\x833', which means its mapping should be defined in unidecode/x08.py. For some reason unidecode does not ship with a x08.py. Not sure why. If you wish, you could copy x07.py --> x08.py, and edit it appropriately, to define a mapping for u'\x833' however you wish...

unutbu 2010-06-10 23:33:00

Answer 2

+3 A:

There is no such thing as a "high ascii character". The ASCII character set is limited to ordinal in range(128).

That aside, this is a FAQ. Here's one answer. In general, you should familiarise yourself with str.translate() and unicode.translate() -- very handy for multiple substitutions of single bytes/characters. Beware of answers that mention only the unicodedata.normalize() gimmick; that's just one part of the solution.

Update: The currently-accepted answer blows away characters that don't have a decomposition, as pointed out by Mark Tolonen. There seems to be a lack of knowledge of what unicode.translate() is capable of. It CAN translate one character into multiple characters. Here is the output from help(unicode.translate):

S.translate(table) -> unicode

Return a copy of the string S, where all characters have been mapped through the given translation table, which must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None. Unmapped characters are left untouched. Characters mapped to None are deleted.

Here's an example:

>>> u"Gau\xdf".translate({0xdf: u"ss"})
u'Gauss'
>>>

Here's a table of fix-ups from the solution that I pointed to:

CHAR_REPLACEMENT = {
    # latin-1 characters that don't have a unicode decomposition
    0xc6: u"AE", # LATIN CAPITAL LETTER AE
    0xd0: u"D",  # LATIN CAPITAL LETTER ETH
    0xd8: u"OE", # LATIN CAPITAL LETTER O WITH STROKE
    0xde: u"Th", # LATIN CAPITAL LETTER THORN
    0xdf: u"ss", # LATIN SMALL LETTER SHARP S
    0xe6: u"ae", # LATIN SMALL LETTER AE
    0xf0: u"d",  # LATIN SMALL LETTER ETH
    0xf8: u"oe", # LATIN SMALL LETTER O WITH STROKE
    0xfe: u"th", # LATIN SMALL LETTER THORN
    }

This can be easily extended to cater for the fancy quotes and other non-latin-1 characters found in cp1252 and siblings.

John Machin 2010-05-18 02:39:57

Thanks, I meant unicode, but at this time of night, that's what I get.

Rhubarb 2010-05-18 02:45:53

Answer 3

+1 A:

If unicodedata.normalize() as suggested by ~unubtu doesn't do the trick, for example if you want more control over the mapping, you should look into
str.translate()
along with str.maketrans(), a utility to produce a map table, str.translate is both efficient and convenient for this type of translation.
In Python 2.x and for unicode strings one needs to use unicode.translate() rather than str.translate() and a trick similar to the one shown in the code snippet below, in lieu of maketrans(). (thanks to John Machin for pointing this out!)

These methods are also availble in in Python 3.x see for example the Python 3.1.2 documentation (for some reason I had made a mental note that this may have changed in Python 3.x). Of course under Python 3, all strings are unicode strings, but that's other issue.

#Python 3.1
>>> intab = 'àâçêèéïîôù'
>>> outtab = 'aaceeeiiou'
>>> tmap = str.maketrans(intab, outtab)
>>> s = "à la fête de l'été, où il fait bon danser, les Français font les drôles"
>>> s
"à la fête de l'été, où il fait bon danser, les Français font les drôles"
>>> s.translate(tmap)
"a la fete de l'ete, ou il fait bon danser, les Francais font les droles"
>>>


#Python 2.6
>>> intab = u'àâçêèéïîôù'
>>> outtab = u'aaceeeiiou'
>>> s = u"à la fête de l'été, où il fait bon danser, les Français font les drôles"
>>> #note the trick to replace maketrans() since for unicode strings the translation
>>> #     map expects integers (unicode ordinals) not characters.
>>> tmap = dict(zip(map(ord, intab), map(ord, outtab))) 
>>> s.translate(tmap)
u"a la fete de l'ete, ou il fait bon danser, les Francais font les droles"
>>>

mjv 2010-05-18 02:40:06

Wrong. In Python2.x, use `unicode.translate()` not `str.translate()`

John Machin 2010-05-20 22:23:14

@John Machin: Right you are! Thanks for noting this. I edited accordingly and added code snippets for both 3.1 and 2.6.

mjv 2010-05-21 01:19:46

Answer 4

A:

Here's a solution that handles latin-1 characters (based on a 2003 usenet thread):

>>> accentstable = str.join("", map(chr, range(192))) + "AAAAAAACEEEEIIIIDNOOOOOxOUUUUYTsaaaaaaaceeeeiiiidnooooo/ouuuuyty"
>>> import string
>>> s = u"éèêàùçÇ"
>>> print string.translate(s.encode('latin1', 'ignore'), accentstable)
eeeaucC

Some of the mappings aren't perfect e.g. Thorn maps to T rather than Th, but it does a tolerable job.

Duncan 2010-05-18 07:49:12

Answer 5

+2 A:

I believe that unicodedata doesn't work for fancy quotes. You could use Unidecode in this case:

import unidecode
print unidecode.unidecode(u"ß‘’“”")
# -> ss''""

J.F. Sebastian 2010-05-20 19:02:22

ansaurus

tags:

views:

answers:

What's the fastest way to strip and replace a document of high unicode characters using Python?

related questions