ansaurus

Question

How to remove extended ascii using python?

Answer 1

A:

When reading from a file in Python you're getting byte strings, aka "str" in Python 2.x and earlier. You need to convert these to the "unicode" type using the decode method. eg:

line = line.decode('latin1')

Replace 'latin1' with the correct encoding.

Laurence Gonsalves 2009-11-06 06:04:09

Answer 2

+3 A:

Try print line.decode('iso-8859-1').encode('ascii', 'ignore') -- that should be much closer to what you want.

Alex Martelli 2009-11-06 06:08:21

This seems to work although MakeBook is now complaining about illegal control codes.

Jauder Ho 2009-11-06 06:37:22

@Jauder, you can of course remove control codes too, for example after the above `clean=''.join(c for c in line if ord(c)>=32)` (removes ALL control codes including newline and carriage return -- adjust to taste, we can't really do it for you without knowing WHAT control codes you want to remove!-).

Alex Martelli 2009-11-06 06:42:32

@Alex, if I knew, I would =). Trouble is that I'm working with just a Java prog without source available that only emits a cryptic error message. http://gist.github.com/227882

Jauder Ho 2009-11-06 10:07:20

But ideally, I would want to remove spurious control codes but keeping the LF/CR.

Jauder Ho 2009-11-06 10:16:36

@Jauder, fine, but I don't know which ones are "spurious". What about: `spurious=set(chr(c) for c in range(32))-set('\r\n\t')` and of course `clean-''.join(c for c in line if c not in spurious`, then interactively adjust `spurious` by empirically trying until it is exactly the set of characters you need to remove.

Alex Martelli 2009-11-06 15:23:18

Answer 3

+2 A:

You would like to treat line as ASCII-encoded data, so the answer is to decode it to text using the ascii codec:

line.decode('ascii')

This will raise errors for data that is not in fact ASCII-encoded. This is how to ignore those errors:

line.decode('ascii', 'ignore').

This gives you text, in the form of a unicode instance. If you would rather work with (ascii-encoded) data rather than text, you may re-encode it to get back a str or bytes instance (depending on your version of Python):

line.decode('ascii', 'ignore').encode('ascii')

Paul Du Bois 2009-11-06 06:17:55

Answer 4

A:

To drop non-ASCII characters use line.decode(your_file_encoding).encode('ascii', 'ignore'). But probably you'd better use PLM escape sequences for them:

import re

def escape_unicode(m):
    return '\\U%04x' % ord(m.group())

non_ascii = re.compile(u'[\x80-\uFFFF]', re.U)

line = u'\\B1a\\B \\tintense, disordered and often destructive rage\u2020.\u2020.\u2020.\\t'
print non_ascii.sub(escape_unicode, line)

This outputs \B1a\B \tintense, disordered and often destructive rage\U2020.\U2020.\U2020.\t.

Dropping non-ASCII and control characters with regular expression is easy too (this can be safely used after escaping):

regexp = re.compile('[^\x09\x0A\x0D\x20-\x7F]')
regexp.sub('', line)

Denis Otkidach 2009-11-06 11:02:05

ansaurus

tags:

views:

answers:

How to remove extended ascii using python?

related questions