ansaurus

Question

Split by \b when your regex engine doesn't support it

Answer 1

A:

Try

>>> re.compile(r'\W\b').split('hello, foo')
['hello,', 'foo']

This splits at the non-word characted before a boundry. Your example has nothing to split on.

gnud 2008-12-29 20:31:15

Answer 2

+5 A:

(\W+) can give you the expected output:

>>> re.compile(r'(\W+)').split('hello, foo')
['hello', ', ', 'foo']

CMS 2008-12-29 20:38:26

Can you explain why?

Robert Gamble 2008-12-29 20:47:05

Because it splits non-word characters, (in this case the whitespace and the comma...), and the capturing parentheses are used in pattern, so the text of the group in the pattern is also returned as part of the resulting list.

CMS 2008-12-29 21:07:43

@CMS: You might want to mention the option for a re.U flag, or a "(?u)" prefix in the regex, since we live in a multilingual world.

ΤΖΩΤΖΙΟΥ 2008-12-31 00:46:43

Answer 3

+1 A:

Ok I figured it out:

Put the split pattern in capturing parens and will be included in the output. You can use either \w+ or \W+:

>>> re.compile(r'(\w+)').split('hello, foo')
['', 'hello', ', ', 'foo', '']

To get rid of the empty results, pass it through filter() with None as the filter function, which will filter anything that doesn't evaluate to true:

>>> filter(None, re.compile(r'(\w+)').split('hello, foo'))
['hello', ', ', 'foo']

Edit: CMS points out that if you use \W+ you don't need to use filter()

ʞɔıu 2008-12-29 20:39:05

Answer 4

A:

Interesting. So far most RE engines I tried do this split.

I played a bit and found that re.compile(r'(\W+)').split('hello, foo') is giving the output you expected... Not sure if that's reliable, though.

PhiLho 2008-12-29 20:39:57

Answer 5

+1 A:

One can also use re.findall() for this:

>>> re.findall(r'.+?\b', 'hello, foo')
['hello', ', ', 'foo']

PEZ 2008-12-29 21:41:31

ansaurus

tags:

views:

answers:

Split by \b when your regex engine doesn't support it

related questions