tags:

views:

453

answers:

4

I've got a string that I'm trying to split into chunks based on blank lines.

Given a string s, I thought I could do this:

re.split('(?m)^\s*$', s)

This works in some cases:

>>> s = 'foo\nbar\n \nbaz'
>>> re.split('(?m)^\s*$', s)
['foo\nbar\n', '\nbaz']

But it doesn't work if the line is completely empty:

>>> s = 'foo\nbar\n\nbaz'
>>> re.split('(?m)^\s*$', s)
['foo\nbar\n\nbaz']

What am I doing wrong?

[python 2.5; no difference if I compile '^\s*$' with re.MULTILINE and use the compiled expression instead]

A: 

Looks like this functions as designed. From http://docs.python.org/library/re.html

Note that split will never split a string on an empty pattern match. For example:

>>> re.split('x*', 'foo')
['foo']
>>> re.split("(?m)^$", "foo\n\nbar\n")
['foo\n\nbar\n']

You may want to try building something around re.finditer instead.

Zac Thompson
Ah. Oh well. Thanks.
John Fouhy
A: 

What you're doing wrong is using regular expressions. What is wrong with ('Some\ntext.').split('\n')?

Daniel Straight
He wants to match blank lines that may have whitespace. Splitting on "\n" will split every line apart. Splittong on "\n\n" (which is probably what you meant) won't work on blank lines with whitespace on them.
Glenn Maynard
because that doesn't split the input where he asked for. He wants to separate groups of text by multiple newlines. IE two lines containing text, separated by a single newline are not separated, but if separated by two (or presumably more) newlines, with only whitespace on any blank lines, should be separate.
TokenMacGuy
So don't say "blank" if you don't mean "blank."
Daniel Straight
+5  A: 

Try this instead:

re.split('\n\s*\n', s)

The problem is that "$ *^" actually only matches "spaces (if any) that are alone on a line"--not the newlines themselves. This leaves the delimiter empty when there's nothing on the line, which doesn't make sense.

This version also gets rid of the delimiting newlines themselves, which is probably what you want. Otherwise, you'll have the newlines stuck to the beginning and end of each split part.

Treating multiple consecutive blank lines as defining an empty block ("abc\n\n\ndef" -> ["abc", "", "def"]) is trickier...

Glenn Maynard
However, it leaves even-numbered empty lines at the beginning of their chunks, which might not be desired.
eswald
Try the alternate (added).
Glenn Maynard
Funny how your mind can get stuck in a rut.. I needed multiline for some other matching, and so it seemed obvious to use it here. So much for "obvious". I will keep Zac's answer as accepted because he quoted my exact situation from the docs, but your answer is very helpful too!
John Fouhy
I gave an explanation and a solution; he didn't.
Glenn Maynard
D'oh! Well, that's what I get for rushing to answer. I actually think that your alternate example here is equivalent to your first; \s includes \n, after all. In other words, I don't think eswald is right. Also this answer will not "deal" with terminating newlines in the string to be split, if that matters. But it's better than my "give up and go home" approach.
Zac Thompson
Oh, you're right. Somehow "\s*" became " *" while I was editing. I've fixed the answer; that simplifies it a lot.
Glenn Maynard
A: 

Is this what you want?

>>> s = 'foo\nbar\n\nbaz'
>>> re.split('\n\s*\n',s)
['foo\nbar', 'baz']

>>> s = 'foo\nbar\n \nbaz'
>>> re.split('\n\s*\n',s)
['foo\nbar', 'baz']

>>> s = 'foo\nbar\n\t\nbaz'
>>> re.split('\n\s*\n',s)
['foo\nbar', 'baz']
Sinan Ünür