views:

6881

answers:

5

I use to run

$s =~ s/[^[:print:]]//g;

on Perl to get rid of non printable characters.

In Python there's no POSIX regex classes, and I can't write [:print:] having it mean what I want. I know of no way in Python to detect if a character is printable or not.

What would you do?

EDIT: It has to support Unicode characters as well. The string.printable way will happily strip them out of the output. curses.ascii.isprint will return false for any unicode character.

A: 

The best I've come up with now is (thanks to the python-izers above)

def filter_non_printable(str):
  return ''.join([c for c in str if ord(c) > 31 or ord(c) == 9])

This is the only way I've found out that works with Unicode characters/strings

Any better options?

Vinko Vrsalovic
Unless you're on python 2.3, the inner []s are redundant. "return ''.join(c for c ...)"
Aaron Gallagher
Not quite redundant—they have different meanings (and performance characteristics), though the end result is the same.
Miles
+7  A: 

As far as I know, the most pythonic/efficient method would be: import string

filtered_string = filter(lambda x: x in string.printable, myStr)
William Keller
You probably want filtered_string = ''.join(filter(lambda x:x in string.printable, myStr)so that you get back a string.
Nathan Sanders
Sadly string.printable does not contain unicode characters, and thus ü or ó will not be in the output... maybe there is something else?
Vinko Vrsalovic
You should be using a list comprehension or generator expressions, not filter + lambda. One of these will 99.9% of the time be faster. ''.join(s for s in myStr if s in string.printable)
Aaron Gallagher
The lot of you are correct, of course.I should stop trying to help people while sleep-deprived!
William Keller
+2  A: 

This function uses list comprehensions and str.join, so it runs in linear time instead of O(n^2):

from curses.ascii import isprint

def printable(input):
    return ''.join([char for char in input if isprint(char)])
Just Some Guy
isprint is also not unicode aware :/
Vinko Vrsalovic
+10  A: 

Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.

import unicodedata, re

all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) == 'Cc')
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0,32) + range(127,160)))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)
Ants Aasma
Is 'Cc' enough here? I don't know, I'm just asking -- it seems to me that some of the other 'C' categories may be candidates for this filter as well.
Patrick Johnmeyer
+1  A: 

You could try setting up a filter using the unicodedata.category() function:

printable = Set('Lu', 'Ll', ...)
def filter_non_printable(str):
  return ''.join(c for c in str if unicodata.category(c) in printable)

See the Unicode database for the available categories

Ber
you started a list comprehension which did not end in your final line. I suggest you remove the opening bracket completely.
ΤΖΩΤΖΙΟΥ
Thank you for pointing this out. I edited the post accordingly
Ber