ansaurus

Question

Python - efficient method to remove all non-letters and replace them with underscores.

Answer 1

+7 A:

import re
title = re.sub("[\W\d]", "_", title.strip())

should be faster.

If you want to replace a succession of adjacent non-letters with a single underscore, use

title = re.sub("[\W\d]+", "_", title.strip())

instead which is even faster.

I just ran a time comparison:

C:\>python -m timeit -n 100 -s "data=open('test.txt').read().strip()" "''.join(map(lambda x: x if (x.isupper() or x.islower()) else '_', data))"
100 loops, best of 3: 4.51 msec per loop

C:\>python -m timeit -n 100 -s "import re; regex=re.compile('[\W\d]+'); data=open('test.txt').read().strip()" "title=regex.sub('_',data)"
100 loops, best of 3: 2.35 msec per loop

This will work an Unicode strings, too (under Python 3, \W matches any character which is not a Unicode word character. Under Python 2, you'd have to additionally set the UNICODE flag for this).

Tim Pietzcker 2010-01-31 08:56:24

The way you are using timeit is counting all the time to open and read the file. You should move that stuff to the `-s` part to get meaningful results

gnibbler 2010-01-31 09:37:50

Thanks, you're right of course. Since this was done in both examples, the error should be (and is, I just tried) the same in both. Interestingly, I found that precompiling the regex didn't make a difference. I updated the timing examples anyway.

Tim Pietzcker 2010-01-31 09:53:36

Note that you can't set flags with re.sub; you have to use re.compile() to specify flags and call sub() on the result. (This is a strange API omission.) Your answer would be better with the "+" version removed; it's not what he asked for, so it's just distracting.

Glenn Maynard 2010-01-31 10:29:06

I do have both versions in my answer so he can choose whichever suits his needs better. Interestingly, I didn't find a big performance improvement from precompiling the regex...

Tim Pietzcker 2010-01-31 11:10:31

The question told you which one he needs. The correct answer should at least go first, and alternates later.

Glenn Maynard 2010-01-31 21:34:48

@Glenn: OK. (15 characters minimum).

Tim Pietzcker 2010-02-01 06:33:37

Answer 2

+10 A:

The faster way to do it is to use str.translate() This is ~50 times faster than your way

# You only need to do this once
>>> title_trans=''.join(chr(c) if chr(c).isupper() or chr(c).islower() else '_' for c in range(256))

>>> "abcde1234!@%^".translate(title_trans)
'abcde________'

# Using map+lambda
$ python -m timeit '"".join(map(lambda x: x if (x.isupper() or x.islower()) else "_", "abcd1234!@#$".strip()))'
10000 loops, best of 3: 21.9 usec per loop

# Using str.translate
$ python -m timeit -s 'titletrans="".join(chr(c) if chr(c).isupper() or chr(c).islower() else "_" for c in range(256))' '"abcd1234!@#$".translate(titletrans)'
1000000 loops, best of 3: 0.422 usec per loop

# Here is regex for a comparison
$ python -m timeit -s 'import re;transre=re.compile("[\W\d]+")' 'transre.sub("_","abcd1234!@#$")'
100000 loops, best of 3: 3.17 usec per loop

Here is a version for unicode

# coding: UTF-8

def format_title_unicode_translate(title):
    return title.translate(title_unicode_trans)

class TitleUnicodeTranslate(dict):
    def __missing__(self,item):
        uni = unichr(item)
        res = u"_"
        if uni.isupper() or uni.islower():
            res = uni
        self[item] = res
        return res
title_unicode_trans=TitleUnicodeTranslate()

print format_title_unicode_translate(u"Metallica Μεταλλικα")

Note that the Greek letters count as upper and lower, so they are not substituted. If they are to be substituted, simply change the condition to

        if item<256 and (uni.isupper() or uni.islower()):

gnibbler 2010-01-31 09:03:27

+1, very good idea. The only drawback I can think of is that this will not work properly on Unicode strings if non-ASCII characters have to be considered.

Tim Pietzcker 2010-01-31 09:44:19

@Tim, unicode has a translate also - the semantics are different though, let me see if I can get it to work...

gnibbler 2010-01-31 09:53:44

@Tim, unicode version is up. The translation mapping is built on demand, so will have less and less misses as more strings are translated.

gnibbler 2010-01-31 10:18:47

This is cute, but not a terribly good idea. The regex version is clearer and faster. This dict will also grow without bound if it's used on arbitrary input. (Bounds checking UTF-8 values can avoid this becoming a potential denial-of-service attack, but not all apps want to do that and it shouldn't normally be necessary.)

Glenn Maynard 2010-01-31 10:54:08

Answer 3

+2 A:

Instead of (x.isupper() or x.islower()) you should be able to use x.isalpha(). The isalpha() method might return True for '_' (I don't remember if it does or not) but then you'll just end up replacing '_' with '_' so no harm done. (Thanks for pointing that out, KennyTM.)

MatrixFrog 2010-01-31 09:06:55

Actually, it might count '_' itself as an alphabetic character, so maybe not. Try it and see.

MatrixFrog 2010-01-31 09:07:36

@Matrix: Replacing `_` with `_` (or not) is harmless.

KennyTM 2010-01-31 09:10:55

Answer 4

+1 A:

Curious about this for my own reasons I wrote a quick script to test the different approaches listed here along with just removing the lambda which I expected (incorrectly) would speed up the original solution.

The short version is that the str.translate approach blows the other ones away. As an aside the regex solution, while a close second, is in correct as written above.

Here is my test program:

import re
from time import time


def format_title(title):
    return ''.join(map(lambda x: x if (x.isupper() or x.islower()) else "_",
                       title.strip()))


def format_title_list_comp(title):
    return ''.join([x if x.isupper() or x.islower() else "_" for x in
                    title.strip()])


def format_title_list_comp_is_alpha(title):
    return ''.join([x if x.isalpha() else "_" for x in title.strip()])


def format_title_is_alpha(title):
    return ''.join(map(lambda x: x if x.isalpha() else '_', title.strip()))


def format_title_no_lambda(title):

    def trans(c):
        if c.isupper() or c.islower():
            return c
        return "_"

    return ''.join(map(trans, title.strip()))


def format_title_no_lambda_is_alpha(title):

    def trans(c):
        if c.isalpha():
            return c
        return "_"

    return ''.join(map(trans, title.strip()))


def format_title_re(title):
    return re.sub("[\W\d]+", "_", title.strip())


def format_title_re_corrected(title):
    return re.sub("[\W\d]", "_", title.strip())


TITLE_TRANS = ''.join(chr(c) if chr(c).isalpha() else '_' for c in range(256))


def format_title_with_translate(title):
    return title.translate(TITLE_TRANS)


ITERATIONS = 200000
EXAMPLE_TITLE = "abc123def_$%^!FOO BAR*bazx-bif"


def timetest(f):
    start = time()
    for i in xrange(ITERATIONS):
        result = f(EXAMPLE_TITLE)
    diff = time() - start
    return result, diff


baseline_result, baseline_time = timetest(format_title)


def print_result(f, result, time):
    if result == baseline_result:
        msg = "CORRECT"
    else:
        msg = "INCORRECT"
    diff = time - baseline_time
    if diff < 0:
        indicator = ""
    else:
        indicator = "+"
    pct = (diff / baseline_time) * 100
    print "%s: %0.3fs %s%0.3fs [%s%0.4f%%] (%s - %s)" % (
        f.__name__, time, indicator, diff, indicator, pct, result, msg)


print_result(format_title, baseline_result, baseline_time)

print "----"

for f in [format_title_is_alpha,
          format_title_list_comp,
          format_title_list_comp_is_alpha,
          format_title_no_lambda,
          format_title_no_lambda_is_alpha,
          format_title_re,
          format_title_re_corrected,
          format_title_with_translate]:
    alt_result, alt_time = timetest(f)
    print_result(f, alt_result, alt_time)

And here are the results:

format_title: 3.121s +0.000s [+0.0000%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)
----
format_title_is_alpha: 2.336s -0.785s [-25.1470%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)
format_title_list_comp: 2.369s -0.751s [-24.0773%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)
format_title_list_comp_is_alpha: 1.735s -1.386s [-44.4021%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)
format_title_no_lambda: 2.992s -0.129s [-4.1336%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)
format_title_no_lambda_is_alpha: 2.377s -0.744s [-23.8314%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)
format_title_re: 1.290s -1.831s [-58.6628%] (abc_def__FOO_BAR_bazx_bif - INCORRECT)
format_title_re_corrected: 1.338s -1.782s [-57.1165%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)
format_title_with_translate: 0.098s -3.022s [-96.8447%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)

EDITED: I added a variation that shows list comprehensions significantly improve the original implementation as well as a correct regex implementation that shows it's still nearly as fast when correct. Of course str.translate still wins hands down.

John 2010-01-31 10:12:21

The regex solution is incorrect since it replaces several adjacent non-letters as just one underscore. Drop the `+` after the character class and it will be correct albeit slower. I guess the question is whether you really want to have long stretches of underscores in your replacement strings or not...

Tim Pietzcker 2010-01-31 11:01:55

Answer 5

A:

import string,sys
letters=string.letters
mystring = list("abc134#$@e##$%%$*&(()#def")
for n,c in enumerate(mystring):
  if not c in letters:
    mystring[n]="_"
print ''.join(mystring)

2010-01-31 11:54:02

ansaurus

tags:

views:

answers:

Python - efficient method to remove all non-letters and replace them with underscores.

related questions