tags:

views:

895

answers:

6

Hi.

I wanted to ask why .NET regex doesn't treat \n as end of line character? Sample code:

string[] words = new string[] { "ab1", "ab2\n", "ab3\n\n", "ab4\r", "ab5\r\n", "ab6\n\r" };
Regex regex = new Regex("^[a-z0-9]+$");
foreach (var word in words)
{
    Console.WriteLine("{0} - {1}", word, regex.IsMatch(word));
}

And this is the response I get:

ab1 - True
ab2
 - True
ab3

 - False
 - False
ab5
 - False
ab6
 - False

I don't understand why the regex matches ab2\n?

Update: I don't think MultiLine is a good solution, i.e. I want to validate login to match only specified characters and it must be single line. If I change the constructor for MultiLine option ab1, ab2, ab3 and ab6 match the expression, ab4 and ab5 don't match it.

Thanks in advance for help.

A: 

From RegexOptions:

Multiline mode. Changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string.

So basically if you pass a RegexOptions.Multiline to the Regex constructor you are instructing that instance to treat the final $ as a match for newline characters - not simply the end of the string itself.

Andrew Hare
As far as I understand it, I am specifying all the characters that may appear in the string and these characters are characters within range of [a-z0-9]. I'm not allowing \n to appear in the string, however the regex still matches string with \n. I don't understand what MultiLine has to do with it.
empi
A: 

Could be the ususal windows/linux line ending differences. But it's still strange that \n\n gets a false this way... Did you try with the RegexOptions.Multiline flag set?

SztupY
+3  A: 

If the string ends with a line break the RegexOptions.Multiline will not work. The $ will just ignore the last line break since there is nothing after that.

If you want to match till the very end of the string and ignore any line breaks use \z

Regex regex = new Regex(@"^[a-z0-9]+\z", RegexOptions.Multiline);

This is for both MutliLine and SingleLine, that doesn't matter.

Smazy
Smazy, you are right. I forgot about \Z \z metacharacters (+1)
eu-ge-ne
It works, but do you know if this approach can cause any other problems? What is the difference between \z and $?
empi
\z matches only the end of the string, regardless of newlines
eu-ge-ne
A: 

Just to give more details to Smazy answer. This an extract from: Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan. Copyright 2009 Jan Goyvaerts and Steven Levithan, 978-0-596-2068-7

The difference between ‹\Z› and ‹\z› comes into play when the last character in your subject text is a line break. In that case, ‹\Z› can match at the very end of the subject text, after the final line break, as well as immediately before that line break. The benefit is that you can search for ‹omega\Z› without having to worry about stripping off a trailing line break at the end of your subject text. When reading a file line by line, some tools include the line break at the end of the line, whereas others don’t; ‹\Z› masks this difference. ‹\z› matches only at the very end of the subject text, so it will not match text if a trailing line break follows. The anchor ‹$› is equivalent to ‹\Z›, as long as you do not turn on the “^ and $ match at line breaks” option. This option is off by default for all regex flavors except Ruby. Ruby does not offer a way to turn this option off. Just like ‹\Z›, ‹$› matches at the very end of the subject text, as well as before the final line break, if any.

Of course, I wouldn't have found it without Smazy answer.

empi
+2  A: 

The .NET regex engine does treat \n as end-of-line. And that's a problem if your string has Windows-style \r\n line breaks. With RegexOptions.Multiline turned on $ matches between \r and \n rather than before \r.

$ also matches at the very end of the string just like \z. The difference is that \z can match only at the very end of the string, while $ also matches before a trailing \n. When using RegexOptions.Multiline, $ also matches before any \n.

If you're having trouble with line breaks, a trick is to first to a search-and-replace to replace all \r with nothing to make sure all your lines end with \n only.

Jan Goyvaerts
A: 

use regex options

string[] words = new string[] { "ab1", "ab2\n", "ab3\n\n", "ab4\r", "ab5\r\n", "ab6\n\r" }; Regex regex = new Regex("^[a-z0-9]+$"); foreach (var word in words) { Console.WriteLine("{0} - {1}", word, regex.IsMatch(word,"^[a-z0-9]+$",System.Text.RegularExpressions.RegexOptions.Singleline | System.Text.RegularExpressions.RegexOptions.IgnoreCase | System.Text.RegularExpressions.RegexOptions.IgnorePatternWhitespace)); }

Dre