tags:

views:

220

answers:

5

How can I split by word boundary in a regex engine that doesn't support it?

python's re can match on \b but doesn't seem to support splitting on it. I seem to recall dealing with other regex engines that had the same limitation.

example input:

"hello, foo"

expected output:

['hello', ', ', 'foo']

actual python output:

>>> re.compile(r'\b').split('hello, foo')
['hello, foo']
A: 

Try

>>> re.compile(r'\W\b').split('hello, foo')
['hello,', 'foo']

This splits at the non-word characted before a boundry. Your example has nothing to split on.

gnud
+5  A: 

(\W+) can give you the expected output:

>>> re.compile(r'(\W+)').split('hello, foo')
['hello', ', ', 'foo']
CMS
Can you explain why?
Robert Gamble
Because it splits non-word characters, (in this case the whitespace and the comma...), and the capturing parentheses are used in pattern, so the text of the group in the pattern is also returned as part of the resulting list.
CMS
@CMS: You might want to mention the option for a re.U flag, or a "(?u)" prefix in the regex, since we live in a multilingual world.
ΤΖΩΤΖΙΟΥ
+1  A: 

Ok I figured it out:

Put the split pattern in capturing parens and will be included in the output. You can use either \w+ or \W+:

>>> re.compile(r'(\w+)').split('hello, foo')
['', 'hello', ', ', 'foo', '']

To get rid of the empty results, pass it through filter() with None as the filter function, which will filter anything that doesn't evaluate to true:

>>> filter(None, re.compile(r'(\w+)').split('hello, foo'))
['hello', ', ', 'foo']

Edit: CMS points out that if you use \W+ you don't need to use filter()

ʞɔıu
A: 

Interesting. So far most RE engines I tried do this split.

I played a bit and found that re.compile(r'(\W+)').split('hello, foo') is giving the output you expected... Not sure if that's reliable, though.

PhiLho
+1  A: 

One can also use re.findall() for this:

>>> re.findall(r'.+?\b', 'hello, foo')
['hello', ', ', 'foo']
PEZ