tags:

views:

135

answers:

3

I'm new to Python, and I've been playing around with it for simple tasks. I have a bunch of CSVs which I need to manipulate in complex ways, but I'm breaking this up into smaller tasks for the sake of learning Python.

For now, given a list of strings, I want to remove user-defined title prefixes of any names in the strings. Any string which contains a name will contain only a name, with or without a title prefix. I have the following, and it works, but it just feels unnecessarily complicated. Is there a more Pythonic way to do this? Thanks!

# Return new list without title prefixes for strings in a list of strings.
def strip_titles(line, title_prefixes):
    new_csv_line = []
    for item in line:
        for title_prefix in title_prefixes:
            if item.startswith(title_prefix):
                new_csv_line.append(item[len(title_prefix)+1:])
                break
            else:
                if title_prefix == title_prefixes[len(title_prefixes)-1]:
                    new_csv_line.append(item)
                else:
                    continue
    return new_csv_line

if __name__ == "__main__":
    test_csv_line = ['Mr. Richard Stallman', 'I like cake', 'Mrs. Margaret Thatcher', 'Jean-Claude Van Damme']
    test_prefixes = ['Mr.', 'Ms.', 'Mrs.']
    print strip_titles(test_csv_line, test_prefixes)
+8  A: 
[re.sub(r'^(Mr|Ms|Mrs)\.\s+', '', s) for s in test_csv_line]
Marcelo Cantos
Wow. Very cool. It does, however, leave a single space before the name when it strips out the prefixes.
paracaudex
I will never be tired of seeing the beauty of regular expressions.
unkiwii
@paracaudex: you might have seen my first version when commenting. The current version strips all whitespace after the prefix.
Marcelo Cantos
@ Marcelo. Got it, thanks.
paracaudex
+1  A: 

Assuming that prefixes is variable, perhaps as an aspect of localization, or you prefer not to use a regular expression for some other reason, you could do something like this (untested code):

def strip_title(string, prefixes):
    for prefix in prefixes:
         if string.startswith(prefix + ' '):
             return string[len(prefix) + 1:]
    return string

stripped = (list(strip_title(cell, prefixes) for cell in line)
            for line in lines)

This is not particularly efficient, since the algorithm ends up doing a lot of redundant checking (e.g. checking three times if the line starts with M). This sort of thing is a big reason to use regular expressions.

Alternatively, you could dynamically build a regular expression, by escaping each prefix and joining them with | branches:

def TitleStripper(prefixes):
    import re
    escaped_titles = (re.escape(prefix) for prefix in prefixes)
    prefix_re = re.compile('^({0}) '.format('|'.join(escaped_titles)))
    def strip_title(string):
        return prefix_re.sub('', string, 1)
    return strip_title

The function TitleStripper creates a closure function strip_title that works like the previous one but is built for a particular set of prefixes. After you call strip_title = TitleStripper(prefixes) you can just call strip_title(string).

Mostly due to the use of regular expressions, this will be a bit faster than the first method, perhaps at the expense of clarity.

If you really only ever need to check for three prefixes, either of these methods is overkill, and you should just use a static RE as explained in another answer.

intuited
Why would I need to escape each prefix?
paracaudex
For example, you'll need to escape a `.`, i.e. substitute `\.`, so that it doesn't match any character. You can do this with [re.escape](http://docs.python.org/library/re.html#re.escape).
intuited
Ah, I see. I thought you meant escape the entire thing - like \Mr. I didn't realize re had an escape function.
paracaudex
+1  A: 

A more Pythonic approach would be to replace the "end of list" check with an else: clause to the for item in line: loop. The else gets executed if the for loop completes without being interrupted:

# Return new list without title prefixes for strings in a list of strings.    
def strip_titles(line, title_prefixes):
    new_csv_line = []
    for item in line:
        for title_prefix in title_prefixes:
            if item.startswith(title_prefix):
                new_csv_line.append(item[len(title_prefix)+1:])
                break
        else:
            new_csv_line.append(item)
    return new_csv_line

The logic is otherwise the same as yours.

Just Some Guy