views:

1764

answers:

8

Imagine a string, like 'Agh#$%#%2341- -!zdrkfd' and I only wish to perform some operating on it such that only the lowercase letters are returned (as an example), which in this case would bring 'ghzdrkfd'.

How do you do this in Python? The obvious way would be to create a list, of characters, 'a' through 'z', then iterate over the characters in my string and build a new string, character by character, of those in my list only. This seems primitive.

I was wondering if regular expressions are appropriate. Replacing unwanted characters seems problematic and I tend to prefer whitelisting over blacklisting. The .match function does not seem appropriate. I have looked over the appropriate page on the Python site, but have not found a method which seems to fit.

If regular expressions are not appropriate and the correct approach is looping, is there a simple function which "explodes" a string into a list? Or am I just hitting another for loop there?

+2  A: 
s = 'ASDjifjASFJ7364'
s_lowercase = ''.join(filter(lambda c: c.islower(), s))
print s_lowercase #print 'jifj'
jcoon
There is no need to call list on s. String objects are iterable.
Ayman Hourieh
+12  A: 
s = 'Agh#$%#%2341- -!zdrkfd'  
print ''.join(c for c in s if c.islower())

String objects are iterable; there is no need to "explode" the string into a list. You can put whatever condition you want in the list comprehension, and it will filter characters accordingly.

You could also implement this using a regex, but this will only hide the loop. The regular expressions library will still have to loop through the characters of the string in order to filter them.

Ayman Hourieh
isalpha() is not needed because non-alpha characters will return false on islower()
jcoon
@coonj Good point. Fixed.
Ayman Hourieh
This can also be modified to work with a custom character list by changing `c.islower()` to e.g. `c in "abcDEF"`.
Ben Blank
Well darn it-I thought I had the better answer but this is simpler. Incorporating Ben Blank's comment makes the answer suitably general. I assumed I had to make my list first but not at all.
PyNEwbie
+1  A: 

I'd use a regex. For lowercase match [a-z].

Oli
+3  A: 
>>> s = 'Agh#$%#%2341- -!zdrkfd'
>>> ''.join(i for i in s if  i in 'qwertyuiopasdfghjklzxcvbnm')
'ghzdrkfd'
Nixuz
+4  A: 

Using a regular expression is easy enough, especially for this scenario:

>>> import re
>>> s = 'ASDjifjASFJ7364'
>>> re.sub(r'[^a-z]+', '', s)
'jifj'

If you plan on doing this many times, it is best to compile the regular expression before hand:

>>> import re
>>> s = 'ASDjifjASFJ7364'
>>> r = re.compile(r'[^a-z]+')
>>> r.sub('', s)
'jifj'
Paolo Bergantino
To be fair I ran the test again on your pre-compiled version and it is still slower than translate.
Nadia Alramli
The regex should be '[^a-z]+' - this significantly improves performance.
gnud
@gnud, you are right about improving performance. But it is still much slower than translate.
Nadia Alramli
Thanks, gnud, fixed.
Paolo Bergantino
A: 
import string
print "".join([c for c in "Agh#$%#%2341- -!zdrkfd" if c in string.lowercase])
+14  A: 

If you are looking for efficiency. Using the translate function is the fastest you can get.

It can be used to quickly replace characters and/or delete them.

import string
delete_table  = string.maketrans(
    string.ascii_lowercase, ' ' * len(string.ascii_lowercase)
)
table = string.maketrans('', '')

"Agh#$%#%2341- -!zdrkfd".translate(table, delete_table)

In python 2.6: you don't need the second table anymore

import string
delete_table  = string.maketrans(
    string.ascii_lowercase, ' ' * len(string.ascii_lowercase)
)
"Agh#$%#%2341- -!zdrkfd".translate(None, delete_table)

This is method is way faster than any other. Of course you need to store the delete_table somewhere and use it. But even if you don't store it and build it every time, it is still going to be faster than other suggested methods so far.

To confirm my claims here are the results:

for i in xrange(10000):
    ''.join(c for c in s if c.islower())

real    0m0.189s
user    0m0.176s
sys 0m0.012s

While running the regular expression solution:

for i in xrange(10000):
    re.sub(r'[^a-z]', '', s)

real    0m0.172s
user    0m0.164s
sys 0m0.004s

[Upon request] If you pre-compile the regular expression:

r = re.compile(r'[^a-z]')
for i in xrange(10000):
    r.sub('', s)

real    0m0.166s
user    0m0.144s
sys 0m0.008s

Running the translate method the same number of times took:

real    0m0.075s
user    0m0.064s
sys 0m0.012s
Nadia Alramli
To be fair you should compile the regex outside the loop.
Unknown
I'm comparing the top suggested solutions. That's how Paolo Bergantino wrote his expression.
Nadia Alramli
I wrote it as a one-off solution, it would obviously be best compiled, so you should compare it as such.
Paolo Bergantino
I ran the test again with a pre-compiled expression. As you can see it is still more than 2 times slower than translate
Nadia Alramli
In order to be fair, change the regex to '[^a-z]+'. That way, it will replace series of matches in one go, instead of one character at the time.
gnud
@gnud, I tried that, it is a little faster but no match to translate. By the way the larger the string is, the bigger the difference in performance between translate and other methods. The processing time hardly grow with string length in translate.
Nadia Alramli
A: 

Here's one solution if you are specifically interested in working on strings:

 s = 'Agh#$%#%2341- -!zdrkfd'
 lowercase_chars = [chr(i) for i in xrange(ord('a'), ord('z') + 1)]
 whitelist = set(lowercase_chars)
 filtered_list = [c for c in s if c in whitelist]

The whitelist is actually a set (not a list) for efficiency.

If you need a string, use join():

filtered_str = ''.join(filtered_list)


filter() is a more generic solution. From the documentation (http://docs.python.org/library/functions.html):

filter(function, iterable)

Construct a list from those elements of iterable for which function returns true. iterable may be either a sequence, a container which supports iteration, or an iterator. If iterable is a string or a tuple, the result also has that type; otherwise it is always a list. If function is None, the identity function is assumed, that is, all elements of iterable that are false are removed.

This would be one way of using filter():

filtered_list = filter(lambda c: c.islower(), s)