tags:

views:

81

answers:

3

Hello,

I am trying to find every "a -> b, c, d" pattern in an input string. The pattern I am using is the following :

"^[ \t]*(\\w+)[ \t]*->[ \t]*(\\w+)((?:,[ \t]*\\w+)*)$"

This pattern is a C# pattern, the "\t" refers to a tabulation (its a single escaped litteral, intepreted by the .NET String API), the "\w" refers to the well know regex litteral predefined class, double escaped to be interpreted as a "\w" by the .NET STring API, and then as a "WORD CLASS" by the .NET Regex API.

The input is :

a -> b
b -> c
c -> d

The function is :

private void ParseAndBuildGraph(String input) {
    MatchCollection mc = Regex.Matches(input, "^[ \t]*(\\w+)[ \t]*->[ \t]*(\\w+)((?:,[ \t]*\\w+)*)$", RegexOptions.Multiline);
    foreach (Match m in mc) {
        Debug.WriteLine(m.Value);
    }
}

The output is :

c -> d

Actually, there is a problem with the line ending "$" special char. If I insert a "\r" before "$", it works, but I thought "$" would match any line termination (with the Multiline option), especially a \r\n in a Windows environment. Is it not the case ?

+1  A: 

Do you mean \t as a regex \t or a C# \t? I always use verbatim string literals with regex:

@"^[ \t]*(\w+)[ \t]*->[ \t]*(\w+)(,[ \t]*\w+)*$"

(the only thing you need to escape is " to "")

Marc Gravell
Actually \t refers to a tabulation. I do not want to use \\s for whitespace character since I do not want the user to input line endings, only spaces and tabulations are allowed.But my question remains the same with the simplified pattern "^(\\w+) -> (\\w+)((?:, \\w+)*)$". I updated the original question.
Aurélien Ribon
<strike> @Aurélien - yes, but without the verbatim string literal the regex doesn't see `\t` at *all*. </strike>
Marc Gravell
@Marc Gravell - It is a tab or space option, unless regex doesn't allow embedded tabs, he literally means the tab character, not \t
Guvante
@Guvante is right: the regex compiler sees a literal tab character, and that's what it matches (I tested it). With your version, it sees the sequence `\t`, the escape sequence for a tab; it works either way.
Alan Moore
OK; interesting to know, thanks. I was tempted to delete, but this is such a common mistake with C# / regex, so I'm leaving it for posterity.
Marc Gravell
...and "always use verbatim strings for regexes" is always worth repeating.
Alan Moore
I would agree, but how do you insert a tabulation char in a verbatim string ?
Aurélien Ribon
@Aurélien: If you really need a string with a tab in it, use the old-style literal. But for the purpose of matching a tab with a regex, backslash-'t' works just as well, as I said.
Alan Moore
Indeed, I thought Regex won't accept \t, but leanrt that it does :)Thank you, that helped me a lot, I'll use @"" strings now !
Aurélien Ribon
+6  A: 

This surprised me, too. In .NET regexes, $ doesn't match before a line separator, it matches before a linefeed--the character \n. This behavior is consistent with Perl's regex flavor, but it's still wrong, in my opinion. According to the Unicode standard, $ should match before any of:

\n, \r\n, \r, \x85, \u2028, \u2029, \v or \f

...and never match between \r and \n. Java complies with that (except \v and \f), but .NET, which came out long after Java, and whose Unicode support is at least as good as Java's, only recognizes \n. You'd think they would at least handle \r\n correctly, given how strongly Microsoft is associated with that line separator.

Be aware that . follows the same pattern: it doesn't match \n (unless Singleline mode is set), but it does match \r. If you had used .+ instead of \w+ in your regex, you might not have noticed this problem; the carriage-return would have been included in the match, but the console would have ignored it when you printed the results.

EDIT: If you want to allow for the carriage return without including it in your results, you can replace the anchor with a lookahead: (?=\r?\n.

Alan Moore
+1 This is hard to believe...
Tim Pietzcker
Thank you for your answer. And indeed, that's a surprising conclusion :)
Aurélien Ribon
curious if there is a flag that can be set to make the system match the data like O_BINARY flag in c/c++.
Dave
@Dave: Not that I can find. I finally managed to find a mention of this issue, and the only recourse they offer is to preface the anchor with `\r?`: http://msdn.microsoft.com/en-us/library/h5181w5w.aspx#End
Alan Moore
Maybe hard to believe, particularly coming from the developer of the main OS to use \r\n for line breaks, but definitely true. My recommended workaround is to first strip all \r from your input string.
Jan Goyvaerts
+1  A: 

Generally in C, C++, C#, strings within the program use "\n" as line separator. "\r\n" appears only at the I/O layer, if textmode translations are turned on.

Ben Voigt
Good point. It's always surprised me how seldom this issue comes up, and I guess that's one of the reasons. But I still think they were wrong not to go with the Unicode standard.
Alan Moore
\r\n is the Microsoft line termination. For example, Notepad.exe does not recognize the "\n" termination, and only recognizes "\r\n". For Unix users, "\n" is the usual line termination, and for Mac users, "\r" is the way to go. That's a stupid mess ? I agree :-)
Aurélien Ribon
\r\n is the line terminator (in Windows) in a text file. \n is the terminator in code. The I/O layer translates between them if and only if you open the file in "text mode".
Ben Voigt