ansaurus

Question

What is the most efficient way in Python to convert a string to all lowercase stripping out all non-ascii alpha characters?

Answer 1

+9 A:

I would:

lowercase the string
replace all [^a-z] with ""

Like that:

def strip_string_to_lowercase():
  nonascii = re.compile('[^a-z]')
  return lambda s: nonascii.sub('', s.lower().strip())

EDIT: It turns out that the original version (below) is really slow, though some performance can be gained by converting it into a closure (above).

def strip_string_to_lowercase(s):
  return re.sub('[^a-z]', '', s.lower().strip())

My performance measurements with 100,000 iterations against the string

"A235th@#$&( er Ra{}|?>ndom"

revealed that:

f_re_0 took 2672.000 ms (this is the original version of this answer)
f_re_1 took 2109.000 ms (this is the closure version shown above)
f_re_2 took 2031.000 ms (the closure version, without the redundant strip())
f_fl_1 took 1953.000 ms (unwind's filter/lambda version)
f_fl_2 took 1485.000 ms (Coady's filter version)
f_jn_1 took 1860.000 ms (Dana's join version)

For the sake of the test, I did not print the results.

Tomalak 2009-03-12 14:39:28

misread - it - removed comment :-)

TofuBeer 2009-03-12 14:47:04

the strip() isn't particularly needed as anything that strip() would remove is removed by the '[^a-z]' :o)

George Shore 2009-03-12 14:52:14

Two loops or not, it's faster than any other way I've tried, including string.translate.

bobince 2009-03-12 14:55:08

@George Shore: Good Point!

grieve 2009-03-12 14:56:21

@George Shore: You are right. I don't expect it to make much of a difference (performance-wise) though. And I left it in so when you look at the code it's clear instantly that the result will be stripped - it would be a not-so-obvious "side effect" otherwise.

Tomalak 2009-03-12 15:01:15

@bobince: it's the slowest solution posted

SilentGhost 2009-03-12 15:05:25

@SilentGhost: Not anymore.

Tomalak 2009-03-12 15:35:55

did you test your new code?

SilentGhost 2009-03-12 15:49:28

anyway it's only fast because it doesn't work

SilentGhost 2009-03-12 16:01:45

@SilentGhost: You are right. Stupid error on my side.

Tomalak 2009-03-12 16:11:56

@Tomalok: You've the arguments to re.sub reversed - you're operating on the empty string.

Brian 2009-03-12 16:17:28

@Brian: I'm not. It's sub(replacement, string [, count = 0]).

Tomalak 2009-03-12 16:28:12

You might want to mention how your version is used: replacer = strip_string_to_lowercase() print replacer(s)What a pain.

2009-03-12 16:49:41

@Tomalok: I was still seeing your unfixed code, where you were putting the string first. ie: "nonascii.sub(subject.lower().strip(),'')" You've fixed that in the previous change.

Brian 2009-03-12 16:50:15

@timkay - you only need to call it once (and store the value off) - probably immediately after you define it. Alternatively, stick an @apply decorator before the definition (though that is maybe less clear.)

Brian 2009-03-12 17:02:46

@timkay: The regex is not the most efficient way to do it anyway (using a closure buys a little, but not enough), so this won't end up as the accepted answer anyway.

Tomalak 2009-03-12 17:03:13

@Brian: Hopefully I fixed all copy/paste glitches and other stupid mistakes in my code by now. I'm not intending to squeeze any more milliseconds out of the regex approach, it's no use. ;-)

Tomalak 2009-03-12 17:05:54

Answer 2

+9 A:

Not especially runtime efficient, but certainly nicer on poor, tired coder eyes:

def strip_string_and_lowercase(s):
    return ''.join(c for c in s.lower() if c in string.ascii_lowercase)

Dana 2009-03-12 14:39:48

as a matter of fact it's more runtime efficient than mine, let alone Tomalak's

SilentGhost 2009-03-12 14:56:28

@SilentGhost -- Woah! I'm a genius :P

Dana 2009-03-12 14:57:30

it's rather obvious solution :)

SilentGhost 2009-03-12 14:59:32

Join is crazy efficient in python it seems. Most tasks that involve string concatenation are faster via join.

thebigjc 2009-03-12 19:01:22

Answer 3

+2 A:

>>> import string
>>> a = "O235th@#$&( er Ra{}|?&lt;ndom"
>>> ''.join(i for i in a.lower() if i in string.ascii_lowercase)
'otheraltndom'

doing essentially the same as you.

SilentGhost 2009-03-12 14:42:47

Yours skips the capital O and R, sg, because you're testing for membership in ascii_lowercase before you call lower()

Dana 2009-03-12 14:46:09

Answer 4

+2 A:

This is a typical application of list compehension:

import string
s = "O235th@#$&( er Ra{}|?<ndom"
print ''.join(c for c in s.lower() if c in string.ascii_lowercase)

It won't filter out "<" (html entity), as in your example, but I assume that was accidental cut and past problem.

Ber 2009-03-12 14:43:57

Answer 5

A:

Personally I would use a regular expression and then convert the final string to lower case.

I have no idea how to write it in python but the basic idea is

Remove characters in string that don't match case-insensitive regex "\w"
Convert string to lower-case

or vise-versa

Dan Roberts 2009-03-12 14:44:04

Answer 6

+4 A:

Similar to @Dana's, but I think this sounds like a filtering job, and that should be visible in the code. Also without the need to explicitly call join():

def strip_string_to_lowercase(s):
  return filter(lambda x: x in string.ascii_lowercase, s.lower())

unwind 2009-03-12 14:44:10

This would miss the capital 'A' and 'R', but changing that last s to s.lower() should solve that. Thanks for the tip.

grieve 2009-03-12 14:52:32

Oops, sorry, fixed now. Thanks, glad you liked it, bugs and all. :)

unwind 2009-03-12 15:03:55

This seems to be the "most efficient way" the the OP asked about. +1

Tomalak 2009-03-12 16:29:19

It seems there was an optimization lurking in not using the lambda (see @Brian's answer). Great!

unwind 2009-03-13 07:37:55

Answer 7

+12 A:

>>> filter(str.isalpha, "This is a Test").lower()
'thisisatest'
>>> filter(str.isalpha, "A235th@#$&( er Ra{}|?>ndom").lower()
'atherrandom'

Coady 2009-03-12 15:55:32

`str.isalpha` is locale-dependent. It may leave non-ascii characters.

J.F. Sebastian 2009-03-14 09:46:25

Answer 8

+11 A:

Another solution (not that pythonic, but very fast) is to use string.translate - though note that this will not work for unicode. It's also worth noting that you can speed up Dana's code by moving the characters into a set (which looks up by hash, rather than performing a linear search each time). Here are the timings I get for various of the solutions given:

import string, re, timeit

# Precomputed values (for str_join_set and translate)

letter_set = frozenset(string.ascii_lowercase + string.ascii_uppercase)
tab = string.maketrans(string.ascii_lowercase + string.ascii_uppercase,
                       string.ascii_lowercase * 2)
deletions = ''.join(ch for ch in map(chr,range(256)) if ch not in letter_set)

s="A235th@#$&( er Ra{}|?>ndom"

# From unwind's filter approach
def test_filter(s):
    return filter(lambda x: x in string.ascii_lowercase, s.lower())

# using set instead (and contains)
def test_filter_set(s):
    return filter(letter_set.__contains__, s).lower()

# Tomalak's solution
def test_regex(s):
    return re.sub('[^a-z]', '', s.lower())

# Dana's
def test_str_join(s):
    return ''.join(c for c in s.lower() if c in string.ascii_lowercase)

# Modified to use a set.
def test_str_join_set(s):
    return ''.join(c for c in s.lower() if c in letter_set)

# Translate approach.
def test_translate(s):
    return string.translate(s, tab, deletions)


for test in sorted(globals()):
    if test.startswith("test_"):
        assert globals()[test](s)=='atherrandom'
        print "%30s : %s" % (test, timeit.Timer("f(s)", 
              "from __main__ import %s as f, s" % test).timeit(200000))

This gives me:

               test_filter : 2.57138351271
           test_filter_set : 0.981806765698
                test_regex : 3.10069885233
             test_str_join : 2.87172979743
         test_str_join_set : 2.43197956381
            test_translate : 0.335367566218

[Edit] Updated with filter solutions as well. (Note that using set.__contains__ makes a big difference here, as it avoids making an extra function call for the lambda.

Brian 2009-03-12 16:06:22

The timing code was a nice addition. See my answer below where I added in the filter solutions as well.

grieve 2009-03-12 16:17:55

Oops - missed those. I've added a filter solution as well now.

Brian 2009-03-12 16:26:48

Accepting this one, because it is comprehensive. It also contains the filter with a set solution which is the optimal combination of speed and elegance for me.

grieve 2009-03-13 00:50:10

Very nice, translation tables are still my favourite.

Christian Witts 2009-03-13 07:35:04

test_filter_set = lambda s: filter(letter_set.__contains__, s).lower() is slightly faster

J.F. Sebastian 2009-03-14 10:03:50

Good point - no need to call lower() on characters we're just going to throw away. Updated.

Brian 2009-03-17 19:10:40

Great answer - I like the tomalak's regex method since it's cleanest and most pythonic. Anything fewer than 10k runs against it will be sufficiently fast to justify code cleanliness/extensibility over speed.

Adam Nelson 2009-12-02 15:21:30

Answer 9

+1 A:

I added the filter solutions to Brian's code:

import string, re, timeit

# Precomputed values (for str_join_set and translate)

letter_set = frozenset(string.ascii_lowercase + string.ascii_uppercase)
tab = string.maketrans(string.ascii_lowercase + string.ascii_uppercase,
                       string.ascii_lowercase * 2)
deletions = ''.join(ch for ch in map(chr,range(256)) if ch not in letter_set)

s="A235th@#$&( er Ra{}|?>ndom"

def test_original(s):
    tmpStr = s.lower().strip()
    retStrList = []
    for x in tmpStr:
        if x in string.ascii_lowercase:
            retStrList.append(x)

    return ''.join(retStrList)


def test_regex(s):
    return re.sub('[^a-z]', '', s.lower())

def test_regex_closure(s):
  nonascii = re.compile('[^a-z]')
  def replacer(s):
    return nonascii.sub('', s.lower().strip())
  return replacer(s)


def test_str_join(s):
    return ''.join(c for c in s.lower() if c in string.ascii_lowercase)

def test_str_join_set(s):
    return ''.join(c for c in s.lower() if c in letter_set)

def test_filter_set(s):
    return filter(letter_set.__contains__, s.lower())

def test_filter_isalpha(s):
    return filter(str.isalpha, s).lower()

def test_filter_lambda(s):
    return filter(lambda x: x in string.ascii_lowercase, s.lower())

def test_translate(s):
    return string.translate(s, tab, deletions)

for test in sorted(globals()):
    if test.startswith("test_"):
        print "%30s : %s" % (test, timeit.Timer("f(s)", 
              "from __main__ import %s as f, s" % test).timeit(200000))

This gives me:

       test_filter_isalpha : 1.31981746283
        test_filter_lambda : 2.23935583992
           test_filter_set : 0.76511679557
             test_original : 2.13079176264
                test_regex : 2.44295629752
        test_regex_closure : 2.65205913042
             test_str_join : 2.25571266739
         test_str_join_set : 1.75565888961
            test_translate : 0.269259640541

It appears that isalpha is using a similar algorithm, at least in terms of O(), to the set algorithm.

Edit: Added the filter set, and renamed the filter functions to be a little more clear.

grieve 2009-03-12 16:20:46

Answer 10

+2 A:

Clean `translate` method

>>> import string
>>> deletechars = ''.join(set(string.maketrans('',''))
...                       - set(string.ascii_letters))
>>> table = string.maketrans(string.ascii_letters, string.ascii_lowercase*2)
>>> "A235th@#$&( er Ra{}|?>ndom".translate(table, deletechars)
'atherrandom'

Python 3.x `translate` method

>>> import string, sys
>>> deletechars = ''.join(set(map(chr, range(sys.maxunicode)))
...                       - set(string.ascii_letters))
>>> table = str.maketrans(string.ascii_letters, string.ascii_lowercase*2,
...                       deletechars)
>>> "A235th@#$&( er Ra{}|?>ndom".translate(table)
'atherrandom'

J.F. Sebastian 2009-03-12 17:56:26

ansaurus

tags:

views:

answers:

What is the most efficient way in Python to convert a string to all lowercase stripping out all non-ascii alpha characters?

Clean `translate` method

Python 3.x `translate` method

related questions

ansaurus

tags:

views:

answers:

What is the most efficient way in Python to convert a string to all lowercase stripping out all non-ascii alpha characters?

Clean translate method

Python 3.x translate method

related questions

Clean `translate` method

Python 3.x `translate` method