views:

468

answers:

4

In trying to fix up a PML (Palm Markup Language) file, it appears as if my test file has non-ASCII characters which is causing MakeBook to complain. The solution would be to strip out all the non-ASCII chars in the PML.

So in attempting to fix this in python, I have

import unicodedata, fileinput

for line in fileinput.input():
    print unicodedata.normalize('NFKD', line).encode('ascii','ignore')

However, this results in an error that line must be "unicode, not str". Here's a file fragment.

\B1a\B \tintense, disordered and often destructive rage†.†.†.\t

Not quite sure how to properly pass line in to be processed at this point.

A: 

When reading from a file in Python you're getting byte strings, aka "str" in Python 2.x and earlier. You need to convert these to the "unicode" type using the decode method. eg:

line = line.decode('latin1')

Replace 'latin1' with the correct encoding.

Laurence Gonsalves
+3  A: 

Try print line.decode('iso-8859-1').encode('ascii', 'ignore') -- that should be much closer to what you want.

Alex Martelli
This seems to work although MakeBook is now complaining about illegal control codes.
Jauder Ho
@Jauder, you can of course remove control codes too, for example after the above `clean=''.join(c for c in line if ord(c)>=32)` (removes ALL control codes including newline and carriage return -- adjust to taste, we can't really do it for you without knowing WHAT control codes you want to remove!-).
Alex Martelli
@Alex, if I knew, I would =). Trouble is that I'm working with just a Java prog without source available that only emits a cryptic error message. http://gist.github.com/227882
Jauder Ho
But ideally, I would want to remove spurious control codes but keeping the LF/CR.
Jauder Ho
@Jauder, fine, but I don't know which ones are "spurious". What about: `spurious=set(chr(c) for c in range(32))-set('\r\n\t')` and of course `clean-''.join(c for c in line if c not in spurious`, then interactively adjust `spurious` by empirically trying until it is exactly the set of characters you need to remove.
Alex Martelli
+2  A: 

You would like to treat line as ASCII-encoded data, so the answer is to decode it to text using the ascii codec:

line.decode('ascii')

This will raise errors for data that is not in fact ASCII-encoded. This is how to ignore those errors:

line.decode('ascii', 'ignore').

This gives you text, in the form of a unicode instance. If you would rather work with (ascii-encoded) data rather than text, you may re-encode it to get back a str or bytes instance (depending on your version of Python):

line.decode('ascii', 'ignore').encode('ascii')

Paul Du Bois
A: 

To drop non-ASCII characters use line.decode(your_file_encoding).encode('ascii', 'ignore'). But probably you'd better use PLM escape sequences for them:

import re

def escape_unicode(m):
    return '\\U%04x' % ord(m.group())

non_ascii = re.compile(u'[\x80-\uFFFF]', re.U)

line = u'\\B1a\\B \\tintense, disordered and often destructive rage\u2020.\u2020.\u2020.\\t'
print non_ascii.sub(escape_unicode, line)

This outputs \B1a\B \tintense, disordered and often destructive rage\U2020.\U2020.\U2020.\t.

Dropping non-ASCII and control characters with regular expression is easy too (this can be safely used after escaping):

regexp = re.compile('[^\x09\x0A\x0D\x20-\x7F]')
regexp.sub('', line)
Denis Otkidach