views:

480

answers:

5
def format_title(title):  
  ''.join(map(lambda x: x if (x.isupper() or x.islower()) else '_', title.strip()))

Anything faster?

+7  A: 
import re
title = re.sub("[\W\d]", "_", title.strip())

should be faster.

If you want to replace a succession of adjacent non-letters with a single underscore, use

title = re.sub("[\W\d]+", "_", title.strip())

instead which is even faster.

I just ran a time comparison:

C:\>python -m timeit -n 100 -s "data=open('test.txt').read().strip()" "''.join(map(lambda x: x if (x.isupper() or x.islower()) else '_', data))"
100 loops, best of 3: 4.51 msec per loop

C:\>python -m timeit -n 100 -s "import re; regex=re.compile('[\W\d]+'); data=open('test.txt').read().strip()" "title=regex.sub('_',data)"
100 loops, best of 3: 2.35 msec per loop

This will work an Unicode strings, too (under Python 3, \W matches any character which is not a Unicode word character. Under Python 2, you'd have to additionally set the UNICODE flag for this).

Tim Pietzcker
The way you are using timeit is counting all the time to open and read the file. You should move that stuff to the `-s` part to get meaningful results
gnibbler
Thanks, you're right of course. Since this was done in both examples, the error should be (and is, I just tried) the same in both. Interestingly, I found that precompiling the regex didn't make a difference. I updated the timing examples anyway.
Tim Pietzcker
Note that you can't set flags with re.sub; you have to use re.compile() to specify flags and call sub() on the result. (This is a strange API omission.) Your answer would be better with the "+" version removed; it's not what he asked for, so it's just distracting.
Glenn Maynard
I do have both versions in my answer so he can choose whichever suits his needs better. Interestingly, I didn't find a big performance improvement from precompiling the regex...
Tim Pietzcker
The question told you which one he needs. The correct answer should at least go first, and alternates later.
Glenn Maynard
@Glenn: OK. (15 characters minimum).
Tim Pietzcker
+10  A: 

The faster way to do it is to use str.translate() This is ~50 times faster than your way

# You only need to do this once
>>> title_trans=''.join(chr(c) if chr(c).isupper() or chr(c).islower() else '_' for c in range(256))

>>> "abcde1234!@%^".translate(title_trans)
'abcde________'

# Using map+lambda
$ python -m timeit '"".join(map(lambda x: x if (x.isupper() or x.islower()) else "_", "abcd1234!@#$".strip()))'
10000 loops, best of 3: 21.9 usec per loop

# Using str.translate
$ python -m timeit -s 'titletrans="".join(chr(c) if chr(c).isupper() or chr(c).islower() else "_" for c in range(256))' '"abcd1234!@#$".translate(titletrans)'
1000000 loops, best of 3: 0.422 usec per loop

# Here is regex for a comparison
$ python -m timeit -s 'import re;transre=re.compile("[\W\d]+")' 'transre.sub("_","abcd1234!@#$")'
100000 loops, best of 3: 3.17 usec per loop

Here is a version for unicode

# coding: UTF-8

def format_title_unicode_translate(title):
    return title.translate(title_unicode_trans)

class TitleUnicodeTranslate(dict):
    def __missing__(self,item):
        uni = unichr(item)
        res = u"_"
        if uni.isupper() or uni.islower():
            res = uni
        self[item] = res
        return res
title_unicode_trans=TitleUnicodeTranslate()

print format_title_unicode_translate(u"Metallica Μεταλλικα")

Note that the Greek letters count as upper and lower, so they are not substituted. If they are to be substituted, simply change the condition to

        if item<256 and (uni.isupper() or uni.islower()):
gnibbler
+1, very good idea. The only drawback I can think of is that this will not work properly on Unicode strings if non-ASCII characters have to be considered.
Tim Pietzcker
@Tim, unicode has a translate also - the semantics are different though, let me see if I can get it to work...
gnibbler
@Tim, unicode version is up. The translation mapping is built on demand, so will have less and less misses as more strings are translated.
gnibbler
This is cute, but not a terribly good idea. The regex version is clearer and faster. This dict will also grow without bound if it's used on arbitrary input. (Bounds checking UTF-8 values can avoid this becoming a potential denial-of-service attack, but not all apps want to do that and it shouldn't normally be necessary.)
Glenn Maynard
+2  A: 

Instead of (x.isupper() or x.islower()) you should be able to use x.isalpha(). The isalpha() method might return True for '_' (I don't remember if it does or not) but then you'll just end up replacing '_' with '_' so no harm done. (Thanks for pointing that out, KennyTM.)

MatrixFrog
Actually, it might count '_' itself as an alphabetic character, so maybe not. Try it and see.
MatrixFrog
@Matrix: Replacing `_` with `_` (or not) is harmless.
KennyTM
+1  A: 

Curious about this for my own reasons I wrote a quick script to test the different approaches listed here along with just removing the lambda which I expected (incorrectly) would speed up the original solution.

The short version is that the str.translate approach blows the other ones away. As an aside the regex solution, while a close second, is in correct as written above.

Here is my test program:

import re
from time import time


def format_title(title):
    return ''.join(map(lambda x: x if (x.isupper() or x.islower()) else "_",
                       title.strip()))


def format_title_list_comp(title):
    return ''.join([x if x.isupper() or x.islower() else "_" for x in
                    title.strip()])


def format_title_list_comp_is_alpha(title):
    return ''.join([x if x.isalpha() else "_" for x in title.strip()])


def format_title_is_alpha(title):
    return ''.join(map(lambda x: x if x.isalpha() else '_', title.strip()))


def format_title_no_lambda(title):

    def trans(c):
        if c.isupper() or c.islower():
            return c
        return "_"

    return ''.join(map(trans, title.strip()))


def format_title_no_lambda_is_alpha(title):

    def trans(c):
        if c.isalpha():
            return c
        return "_"

    return ''.join(map(trans, title.strip()))


def format_title_re(title):
    return re.sub("[\W\d]+", "_", title.strip())


def format_title_re_corrected(title):
    return re.sub("[\W\d]", "_", title.strip())


TITLE_TRANS = ''.join(chr(c) if chr(c).isalpha() else '_' for c in range(256))


def format_title_with_translate(title):
    return title.translate(TITLE_TRANS)


ITERATIONS = 200000
EXAMPLE_TITLE = "abc123def_$%^!FOO BAR*bazx-bif"


def timetest(f):
    start = time()
    for i in xrange(ITERATIONS):
        result = f(EXAMPLE_TITLE)
    diff = time() - start
    return result, diff


baseline_result, baseline_time = timetest(format_title)


def print_result(f, result, time):
    if result == baseline_result:
        msg = "CORRECT"
    else:
        msg = "INCORRECT"
    diff = time - baseline_time
    if diff < 0:
        indicator = ""
    else:
        indicator = "+"
    pct = (diff / baseline_time) * 100
    print "%s: %0.3fs %s%0.3fs [%s%0.4f%%] (%s - %s)" % (
        f.__name__, time, indicator, diff, indicator, pct, result, msg)


print_result(format_title, baseline_result, baseline_time)

print "----"

for f in [format_title_is_alpha,
          format_title_list_comp,
          format_title_list_comp_is_alpha,
          format_title_no_lambda,
          format_title_no_lambda_is_alpha,
          format_title_re,
          format_title_re_corrected,
          format_title_with_translate]:
    alt_result, alt_time = timetest(f)
    print_result(f, alt_result, alt_time)

And here are the results:

format_title: 3.121s +0.000s [+0.0000%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)
----
format_title_is_alpha: 2.336s -0.785s [-25.1470%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)
format_title_list_comp: 2.369s -0.751s [-24.0773%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)
format_title_list_comp_is_alpha: 1.735s -1.386s [-44.4021%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)
format_title_no_lambda: 2.992s -0.129s [-4.1336%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)
format_title_no_lambda_is_alpha: 2.377s -0.744s [-23.8314%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)
format_title_re: 1.290s -1.831s [-58.6628%] (abc_def__FOO_BAR_bazx_bif - INCORRECT)
format_title_re_corrected: 1.338s -1.782s [-57.1165%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)
format_title_with_translate: 0.098s -3.022s [-96.8447%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)
  • EDITED: I added a variation that shows list comprehensions significantly improve the original implementation as well as a correct regex implementation that shows it's still nearly as fast when correct. Of course str.translate still wins hands down.
John
The regex solution is incorrect since it replaces several adjacent non-letters as just one underscore. Drop the `+` after the character class and it will be correct albeit slower. I guess the question is whether you really want to have long stretches of underscores in your replacement strings or not...
Tim Pietzcker
A: 
import string,sys
letters=string.letters
mystring = list("abc134#$@e##$%%$*&(()#def")
for n,c in enumerate(mystring):
  if not c in letters:
    mystring[n]="_"
print ''.join(mystring)