views:

52

answers:

2

The following code splits a string into a list of words but does not include numbers:

    txt="there_once was,a-monkey.called phillip?09.txt"
    sep=re.compile(r"[\s\.,-_\?]+")
    sep.split(txt)

['there', 'once', 'was', 'a', 'monkey', 'called', 'phillip', 'txt']

This code gives me words and numbers but still includes "_" as a valid character:

re.findall(r"\w+|\d+",txt)
['there_once', 'was', 'a', 'monkey', 'called', 'phillip', '09', 'txt']

What do I need to alter in either piece of code to end up with the desired result of:

['there', 'once', 'was', 'a', 'monkey', 'called', 'phillip', '09', 'txt']
+2  A: 

Here's a quick way that should do it:

re.findall(r"[a-zA-Z0-9]+",txt)

Here's another:

re.split(r"[\s\.,\-_\?]+",txt)

(you just needed to escape the hyphen because it has a special meaning in a character class)

David Zaslavsky
Perfect, thanks! So simple yet so elusive :)
danspants
+1  A: 

For the example case,

sep = re.compile(r"[^a-zA-Z0-9]+")
sea.split(txt)

should work. To separate numbers from words, try

re.findall(r"[a-zA-Z]+|\d+", txt)
outis
Don't include the slashes in Python. They're not part of the regular expression. (The slashes are part of the substitution and matching operators in Perl, but for whatever reason some PCRE libraries seem to want regexes with slashes on the ends)
David Zaslavsky
@David: I realized that shortly after posting. Thanks for the heads-up, though.
outis
no problem, just wanted to clarify for anyone reading ;-) I'd remove my downvote now but it appears to be locked, unless you make another edit
David Zaslavsky
@David: like that?
outis
@outis: that seems to have worked, you are hereby un-downvoted ;-)
David Zaslavsky