views:

66

answers:

3

I'm trying to get paragraphs from a string in C# with Regular Expressions. By paragraphs; I mean string blocks ending with double or more \r\n. (NOT HTML paragraphs <p>)...

Here is a sample text:

For example this is a paragraph with a carriage return here
and a new line here.

At this point, second paragraph starts. A paragraph ends if double or more \r\n is matched or
if reached at the end of the string ($).

I tried the pattern:

Regex regex = new Regex(@"(.*)(?:(\r\n){2,}|\r{2,}|\n{2,}|$)", RegexOptions.Multiline);

but this does not work. It matches every line ending with a single \r\n. What I need is to get all characters including single carriage returns and newline chars till reached a double \r\n.

+2  A: 

.* is being greedy and consuming as much as it can. Your second set of () has a $ so the expression that is being used is (.*)(?). In order to make the .* not be greedy, follow it with a ?.

When you specify RegexOptions.Multiline, .NET will split the input on line breaks. Use RegexOptions.Singleline to make it treat the entire input as one.

Regex regex = new Regex(@"(.*?)(?:(\r\n){2,}|\r{2,}|\n{2,}|$)", RegexOptions.Singleline);
rchern
Thanks a lot. (.+?)(?:(\r\n){2,}|\r{2,}|\n{2,}|$) works...
radgar
A: 

Do you have to use a regular expression? Tools like COCO/R could make this job pretty easy as well. In addition it might just prove to be faster than generating code at runtime using a regex.

COMPILER YourParaProcessor
// your code goes here
TOKENS
newLine= '\r'|'\n'.
paraLetter = ANY - '\n' - '\r' .

YourParaProcessor 
=
 {Paragraph}
.

Paragraph =
  {paraLetter} '\r\n' .
Andrew Matthews
+1  A: 

An opposite approach will be to match the separators instead of the paragraphs, making the problem almost trivial. Consider:

string[] paragraphs = Regex.Split(text, @"^\s*$", RegexOptions.Multiline);

By splitting the input string by empty lines you can easily get all paragraphs. If you only want blank lines with no spaces you can simplify that even further, and use the parretn ^$. In that case you can also use the non-regex String.Split, with an array of separators:

string[] separators = {"\n\n", "\r\r", "\r\n\r\n"};
string[] paragraphs = text.Split(separators,
                                 StringSplitOptions.RemoveEmptyEntries);
Kobi
your approach seems to work too but rchern's approach works faster when you need to insert a prefix to all matches. thanks.
radgar