views:

70

answers:

2

I have a regex expression that I'm doing a split against another string and I'm getting weird results.

        string subjectString = "Triage|Follow Up|QA";
        string[] splitArray = null;
        try
        {
            splitArray = System.Text.RegularExpressions.Regex.Split(subjectString, @"(?<=(^|[^\\]))\|");

            foreach (var item in splitArray)
            {
                System.Diagnostics.Debug.Print(item);
            }
        }
        catch
        {
        }

The items being printed are:

Triage
e
Follow Up
p
QA

The regex behaves correctly in RegexBuddy, but not in C#. Any ideas on what's causing the weird behavior? Extra points for explaining why the split function is acting the way it is.

+5  A: 

The grouping (…) in your look-behind assertion is causing this. Try a non-capturing group instead:

@"(?<=(?:^|[^\\]))\|"

Or no additional grouping at all:

@"(?<=^|[^\\])\|"
Gumbo
+1  A: 

RegexBuddy does not yet emulate .NET's behavior of including text matched by capturing groups in the array returned by Split(). To get the same behavior in .NET as in RegexBuddy, either change all your capturing groups (...) into non-capturing groups (?:...) or use RegexOptions.ExplicitCapture to turn all unnamed groups into non-capturing groups.

By including the capturing groups in the returned array, .NET's Split() function makes it possible to include both the delimiters matched by the regular expression and the text between the delimiters in the array. Splitting using the regex <[^>]+> gets you the text between the HTML tags, without the HTML tags. Splitting using the regex (<[^>]+>) gets you the text between the HTML tags including the HTML tags. (These simple regexes assume the input consists of valid HTML without any HTML comments.)

Jan Goyvaerts