ansaurus

Question

Stripping non printable characters from a string in python

Answer 1

A:

The best I've come up with now is (thanks to the python-izers above)

def filter_non_printable(str):
  return ''.join([c for c in str if ord(c) > 31 or ord(c) == 9])

This is the only way I've found out that works with Unicode characters/strings

Any better options?

Vinko Vrsalovic 2008-09-18 13:17:35

Unless you're on python 2.3, the inner []s are redundant. "return ''.join(c for c ...)"

Aaron Gallagher 2008-09-19 04:08:02

Not quite redundant—they have different meanings (and performance characteristics), though the end result is the same.

Miles 2009-06-03 23:31:20

Answer 2

+7 A:

As far as I know, the most pythonic/efficient method would be: import string

filtered_string = filter(lambda x: x in string.printable, myStr)

William Keller 2008-09-18 13:23:14

You probably want filtered_string = ''.join(filter(lambda x:x in string.printable, myStr)so that you get back a string.

Nathan Sanders 2008-09-18 13:27:56

Sadly string.printable does not contain unicode characters, and thus ü or ó will not be in the output... maybe there is something else?

Vinko Vrsalovic 2008-09-18 13:29:54

You should be using a list comprehension or generator expressions, not filter + lambda. One of these will 99.9% of the time be faster. ''.join(s for s in myStr if s in string.printable)

Aaron Gallagher 2008-09-18 22:49:26

The lot of you are correct, of course.I should stop trying to help people while sleep-deprived!

William Keller 2008-09-19 03:20:23

Answer 3

+2 A:

This function uses list comprehensions and str.join, so it runs in linear time instead of O(n^2):

from curses.ascii import isprint

def printable(input):
    return ''.join([char for char in input if isprint(char)])

Just Some Guy 2008-09-18 13:26:00

isprint is also not unicode aware :/

Vinko Vrsalovic 2008-09-18 13:36:13

Answer 4

+10 A:

Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.

import unicodedata, re

all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) == 'Cc')
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0,32) + range(127,160)))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

Ants Aasma 2008-09-18 14:28:04

Is 'Cc' enough here? I don't know, I'm just asking -- it seems to me that some of the other 'C' categories may be candidates for this filter as well.

Patrick Johnmeyer 2008-09-18 17:10:21

Answer 5

+1 A:

You could try setting up a filter using the unicodedata.category() function:

printable = Set('Lu', 'Ll', ...)
def filter_non_printable(str):
  return ''.join(c for c in str if unicodata.category(c) in printable)

See the Unicode database for the available categories

Ber 2008-09-18 15:25:37

you started a list comprehension which did not end in your final line. I suggest you remove the opening bracket completely.

ΤΖΩΤΖΙΟΥ 2008-09-19 12:13:53

Thank you for pointing this out. I edited the post accordingly

Ber 2008-10-05 15:32:02

ansaurus

tags:

views:

answers:

Stripping non printable characters from a string in python

related questions