views:

425

answers:

4

Hi.

I face some problem. In my string there can be a special character / newline '\r\n'

Part of my regex:

string sRegex = "(?<string>\"+.*\"|'+.*')";

How I should modify this regex to exclude newline from my string?

Thanks for help.

+2  A: 

In most languages (except Ruby I think) multiline parsing has to be enabled explicitly. By multiline parsing i mean including the newline character explicitly, and not implicitly terminating the match upon the newline.

In dotnet you want to do:

Regex.Match("string", "regex", RegexOptions.Multiline)

and "regex" would have to contain strings with the explicitly stated newlines, like

"regex\nnewline"

which would match the inside 2 lines of:

hello
regex
newline
world
Marcin
A: 

You can try something like this:

string sRegex = "(?<string>\"+(.*[\r\n]*)\"|'+(.*[\r\n]*)*')";

It should cover a string like this

"Akim
Khalilov
StackOverflow"

I'm sure that this regex can be optimized.

Because you didn't provide a sample text, it's possible that I'm trying to solve different problem here.

Vadim
+2  A: 

I don't think there's enough information to fully answer your question, but I think we can provide you with enough information to solve it yourself.

Look at Regex Workbench (http://code.msdn.microsoft.com/RegexWorkbench). It's a great tool for figuring out the right regular expression. The binaries provided are for a very old .NET, but you can recompile it.

Review the RegexOptions enumeration (http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexoptions%28VS.71%29.aspx), especially RegexOptions.MultiLine. This is probably something you'll need.

There are two ways to specify options: RegexOptions and "inline contructs" (http://msdn.microsoft.com/en-us/library/yd1hzczs%28VS.71%29.aspx). For example, MultiLine can be specified as:

string sRegex = "(?<string>\"+.*\"|'+.*')?m";

A few additional notes:

I use verbatim strings for regex, because \ is already an escape character in regular expressions, and having to double-escape it makes things messy.

I'd rather store my regular expression in a Regex object than in a string, as it's richer typing. The exception for me is when I am composing strings to make a new regular expression. In that case, I call the variable fooRegexText to make that clear.

I find regular expressions of any complexity difficult to read. I use whitespace in the regular expression to help my poor brain out (using IgnorePatternWhitespace).

Applying those, I'd write:

        Regex regex = new Regex(
@"(?mx) # MultiLine, IgnorePatternWhitespace
    (?<string>
        ""+.*""
            |
        '+.*'
    )
");
Jay Bazuzi
+2  A: 

Are you saying you want to match quoted strings only if they don't contain newlines? If so, you don't have to do anything special because the dot doesn't match newlines by default. Aside from the + after the opening quotes (which makes no sense to me) your regex should work fine. But I second Jay's suggestion that you use verbatim string literals for writing regexes:

Regex sRegex = new Regex(@"(?<string>"".*""|'.*')");

What you do need to watch out for is greediness. For example, if there are two string declarations on the same line, like this:

var s1 = "foo", s2 = "bar";

...the regex will find one match, "foo", s2 = "bar", where you expected it to match "foo" and "bar" separately. To avoid that, you can use a non-greedy quantifier:

Regex sRegex = new Regex(@"(?<string>"".*?""|'.*?')");


If you do want to match strings with newlines in them, you can use the Singleline option, which modifies the behavior of the dot, enabling it to match newlines.

Regex sRegex = new Regex(@"(?<string>"".*?""|'.*?')",
                         RegexOptions.Singleline);

...or you can use the inline modifier:

Regex sRegex = new Regex(@"(?s)(?<string>"".*?""|'.*?')");

Be aware that when you use the dot in singleline mode it's especially important that you use a non-greedy quantifier, since potential matches are no longer confined to a single line. But here's another alternative that's more efficient as well as more predictable:

Regex sRegex = new Regex(@"(?<string>""[^""]*""|'[^']*')");

There's no need to specify singleline mode with this regex because you aren't using the dot metacharacter. The negated character class [^"] matches any character except a quotation mark--including newlines.


Finally, I'd like to say a word about the Multiline option, as there seems to be a lot of confusion about it. People tend to assume that you have to use it whenever the target text is composed of multiple lines (i.e., whenever it contains newline characters). That's a natural assumption, but it's not true.

All multiline mode does is change the behavior of the start and end anchors, ^ and $. Normally they only match the beginning and end of the whole string, but if you turn on multiline mode they also match at the beginning and end of logical lines within the string. For example, given a string declared like this:

"fee fie\nfoe fum"

If you search for the regex ^\w+ in default mode you'll get one match: fee. But if you switch to multiline mode you'll get two: fee and foe. Similarly, \w+$ matches only fum in default mode, but it matches fie and fum in multiline mode. And you can always match a literal \n no matter what mode you're in: singleline, multiline or default.

People also tend to assume singleline and multiline are mutually exclusive, which they aren't. I've even seen people say singleline is the default mode; also not true. Singleline changes the behavior of the dot (.), Multiline changes the behavior of the anchors (^ and $); that's all.

Alan Moore