views:

1490

answers:

5

Ok, I have a multi-line string I'm trying to do some clean-up on.

Each line may or may not be part of a big block of quoted text. Example:

This line is not quoted.
This part of the line is not quoted “but this is.”
This one is not quoted either.
“This entire line is quoted”
Not quoted.
“This line is quoted
and so is this one
and so is this one.”
This is not quoted “but this is
and so is this.”

I need a RegEx replacement that will un-wrap the hard-wrapped quoted lines, i.e., replace "\r\n" with a space, but only between the curly quotes.

Here's how it should look after replacement:

This line is not quoted.
This part of the line is not quoted “but this is.”
This one is not quoted either.
“This entire line is quoted”
Not quoted.
“This line is quoted and so is this one and so is this one.”
This is not quoted “but this is and so is this.”

(Note how the last two lines were multiple lines in the input text.)

Constraints

  • Ideally need a single Regex replace call
  • Using .NET RegEx library
  • The quotes are always start/end curly quotes, not plain ol' double-ticks ("), which should make this a little easier.

Important Constraint

This is not direct .NET code, I'm populating a table of "searchfor/replacewith" strings that are then called via RegEx.Replace. I don't have the ability to add custom code like Match Evaluators, looping through captured groups, etc.

Current answer so far, something along the lines of:

r.Replace("(?<=“)\r\n(?=”)", " ")

Obviously, I'm not even close yet.

The same logic could be applied to, say, color-coding of block comments in programming code--anything inside the block comment is not treated the same way as the stuff outside the comments. (Code is a little trickier since start/end block comment delimiters can also legitimately exist within a literal string, an issue I don't have to deal with here.)

A: 

So the thing to do is to find a string starting with an opening quote, followed by a string which does not contain a closing quote or any \r \n characters, and is followed by a series of one or more \r \n characters, capture everything but the terminal \r \n characters, and replace the whole match with the captured portion.

-- MarkusQ

MarkusQ
So, you are suggesting something like: (“[^\r”]+)\r\n replaced with $1[ ]Close! that will capture the first line break within the quoted text, but not any others... replacement is not recursive.
richardtallent
A: 

I think the simplest way would be to match the quoted sections with “(?s:.*?)” and use a MatchEvaluator to remove any newlines. The MatchEvaluator code could be as simple as

Replace(@"\s+", " ");

You could, of course, refine this to match only quoted sections that actually contain newlines, and replace only newlines within those sections instead of all whitespace, but it's probably not worth the effort.

Alan Moore
I'm programming a set of RegEx calls all made from a table in a particular order, not writing custom code here.
richardtallent
Okay, then see my other answer.
Alan Moore
A: 

You can not do what you want within the limits you have described.

Proof:

  • Your fixed table of replacements will execute a fixed number of calls to replace (call this n)
  • Each replace will only be able to eliminate a fixed number of line breaks (call this number m).

Therefore

  • A quoted block with m*n+1 line breaks will not be properly dealt with.

You either need to increase the power of your setup (e.g. by allowing more complex replacement, recursive replacements, an indefinite repetition flag, or...?) or accept the fact that this task can't be done by your engine.

-- MarkusQ

MarkusQ
If I needed to check for balanced quotes, I think you'd be right the more I looked at the various suggestions. Alan came up with an answer that works based on my specific use case, where I can depend on the quotes being balanced. Thanks for all the help!
richardtallent
+1  A: 

NB: For testing regexes I use http://gskinner.com/RegExr/ which is very useful.

I don't think you can write a single expression that will replace an undefined number of newlines. However, you can write an expression to replace one or several, and either repeatedly run it or write it to deal with the max number of newlines you'll have within one quoted section.

First, you want single-line mode so that your expression matches the whole input string instead of line by line. Put this at the start of your expression to turn it on:

(?s)

Then, you want a look-behind expression to match the start quote:

(?<=“)

And a look-ahead to match the end quote:

(?=”)

Now an expression to match some text, then a newline, then some text:

([^”\r]*)\r?([^”\r]*)

Note that there are two capturing groups for the bits of text around the newline, so you can include that text in your replace expression. This will match text that has just one newline within the quotes. To extend this to two newlines, just add another optional newline and optional following text:

(?s)(?<=“)([^”\r]*)\r?([^”\r]*)\r?([^”\r]*)(?=”)

You could extend this to match as many newlines as you think might occur. Not perfect, but perhaps sufficient. Or if you can repeatedly run the expression on your text then just replace a single one at a time.

Leaving your expression something like this:

r.Replace("(?s)(?<=“)([^”\r]*)\r?([^”\r]*)", "$1 $2")

(This isn't quite correct as it'll add a space after text even if group two doesn't match... but it's a start)

Rory
An elegant form of brute force... good idea. Unfortunately, there may be a few hundred lines of text that need to be joined between the curly quotes. Alan's answer below did the trick.
richardtallent
Actually, because you marked it as accepted, that answer is now above, not below. :-)
Alan Moore
+4  A: 

Assuming all curly quotes are properly balanced, this regex should do what you want:

@"[\r\n]+(?=[^“”]*”)"

The [\r\n]+ will match one or more line separators of any type--Unix (\n), DOS (\r\n) or older Mac (\r). Then the lookahead asserts that there's a close-quote ahead and that there's no open-quote between here and there. Then your replacement text can be a simple space character.

Alan Moore
But what is the replacement?
strager
The replacement would be a string consisting of a single space character. All that's being replaced is the line separator.
Alan Moore
I can, in this case, assume the curly quotes are properly balanced. Genius, Alan. I knew there had to be something that would work without recursion...
richardtallent