ansaurus

Question

Python strings split with multiple separators

Answer 1

+14 A:

A case where regular expressions are justified:

import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r'\w+', DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

RichieHindle 2009-06-29 17:56:39

Oo, nice approach.

Paolo Bergantino 2009-06-29 17:58:43

Thanks.Still interested, though - how can I implement the algorithm used in this module? And why does it not appear in the string module?

ooboo 2009-06-29 18:06:28

I don't know why the string module doesn't have a multi-character split. Maybe it's considered complex enough to be in the realm of regular expressions. As for "how can I implement the algorithm", I'm not sure what you mean... it's there in the re module - just use it.

RichieHindle 2009-06-29 20:06:00

No, I mean - how does this module works? It's not straightforward at all

ooboo 2009-07-04 16:39:27

Regular expressions can be daunting at first, but are very powerful. The regular expression '\w+' means "a word character (a-z etc.) repeated one or more times". There's a HOWTO on Python regular expressions here: http://www.amk.ca/python/howto/regex/

RichieHindle 2009-07-04 19:44:15

I got that - I don't mean how to use the re module (it's pretty complicated in itself) but how is it implemented? split() is rather straightforward to program manually, this is much more difficult...

ooboo 2009-07-06 14:32:55

You want to know how the re module itself works? I can't help you with that I'm afraid - I've never looked at its innards, and my Computer Science degree was a very long time ago. 8-)

RichieHindle 2009-07-06 15:57:08

I'm doing my CS1 so I've got a long way to go... It seems very difficult, at first glance, actually, harder than TSP etc. :)

ooboo 2009-07-08 14:41:17

Answer 2

A:

You want Python's RegEx module's findall() method:

http://www.regular-expressions.info/python.html

Example

Tyson 2009-06-29 17:57:25

Python's "RegEx" module? Python once had a "regex" module which was provided but deprecated up until 2.5 when it vanished.

John Machin 2009-06-29 23:01:27

"re"? As in the one included in 2.6.2? http://docs.python.org/library/re.html The one included in the bleeding edge 3.2a0? http://docs.python.org/dev/py3k/library/re.html Something tells me it's not deprecated and is, in fact, the definitive RegEx module for Python.

Tyson 2009-06-30 07:42:03

Answer 3

+8 A:

re.split()

re.split(pattern, string[, maxsplit=0])

Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list. (Incompatibility note: in the original Python 1.5 release, maxsplit was ignored. This has been fixed in later releases.)

>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('\W+', 'Words, words, words.', 1)
['Words', 'words, words.']

gimel 2009-06-29 17:57:49

Answer 4

+1 A:

try this:

import re

phrase = "Hey, you - what are you doing here!?"
matches = re.findall('\w+', phrase)
print matches

this will print ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

Corey Goldberg 2009-06-29 18:01:00

Answer 5

+1 A:

Another way to achieve this is to use the Natural Language Tool Kit (nltk).

import nltk
data= "Hey, you - what are you doing here!?"
word_tokens = nltk.tokenize.regexp_tokenize(data, r'\w+')
print word_tokens

This prints: ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

The biggest drawback of this method is that you need to install the nltk package.

The benefits are that you can do a lot of fun stuff with the rest of the nltk package once you get your tokens.

tgray 2009-06-29 18:51:37

Answer 6

A:

Use list comprehensions for this stuff...it seems easier

data= "Hey, you - what are you doing here!?"
tokens = [c for c in data if c not in (',', ' ', '-', '!', '?')]

I find this easier to comprehend (read..maintain) than using regexp, simply because I am not that good at regexp...which is the case with most of us :) . Also if you know what set of separators you might be using, you can keep them in a set. With a very huge set, this might be slower...but the 're' module is slow as well.

2009-07-21 05:49:02

OK...this is WRONG!!!!, it works only if you want to get characters out...not words. My mistake.

2009-07-21 07:02:51

for using list comprehension, use what ghostdog74 has given.

2009-07-21 08:22:47

Answer 7

+4 A:

another way, without regex

import string
punc = string.punctuation
thestring =  "Hey, you - what are you doing here!?"
s=list(thestring)
''.join([ o for o in s if not o in string.punctuation ]).split()

ghostdog74 2009-07-21 06:02:03

Answer 8

A:

Kinda late answer :), but I had a similar dilemma and didn't want to use 're' module.

def my_split(s, seps):
    res = [s]
    for sep in seps:
        s, res = res, []
        for seq in s:
            res += seq.split(sep)
    return res

print my_split('1111  2222 3333;4444,5555;6666', [' ', ';', ','])
['1111', '', '2222', '3333', '4444', '5555', '6666']

pprzemek 2010-05-26 09:31:24

ansaurus

tags:

views:

answers:

Python strings split with multiple separators

related questions