tags:

views:

50

answers:

4

I have some data in this form:

@"Managers Alice, Bob, Charlie
Supervisors Don, Edward, Francis"

I need a flat output like this:

@"Managers Alice
Managers Bob
Managers Charlie
Supervisors Don
Supervisors Edward
Supervisors Francis"

The actual "job title" above could be any single word, there's no discrete list to work from.

Replacing the with \r\n is easy enough, as is the first replacement:

Replace (^|\r\n)(\S+\s)([^,\r\n]*),\s
With $1$2$3\r\n$2

But capturing the other names and applying the same prefix is what is eluding me today. Any suggestions?

I'm looking for a series of one or more RegEx.Replace() calls only, without any LINQ or procedural code in C#, which would of course be trivial. The implementation is not directly in C# code, I'm configuring a generic parsing tool that uses a series of .NET regular expressions to transform incoming data from a variety of sources for several uses.

A: 

Why use a regex if you can do it with LINQ?

string s = "Managers Alice, Bob, Charlie\r\nSupervisors Don, Edward, Francis";

var result =
    from line in s.Split(new string[] { "\r\n" }, StringSplitOptions.None)
    let parts = line.Split(new char[] { ' ' }, 2)
    let title = parts[0]
    let names = parts[1]
    from name in names.Split(new char[] { ',' })
    select title.Trim() + " " + name.Trim();

string.Join("\r\n", result) is

Managers Alice
Managers Bob
Managers Charlie
Supervisors Don
Supervisors Edward
Supervisors Francis
dtb
Thanks, but I'm looking for a RegEx solution, not LINQ.
richardtallent
A: 

Since you stressed the need for regex here's a solution that should work for you.

string input = @"Managers Alice, Bob, Charlie
Supervisors Don, Edward, Francis";
string pattern = @"(?<Title>\w+)\s+(?:(?<Names>\w+)(?:,\s+)?)+";

foreach (Match m in Regex.Matches(input, pattern))
{
    Console.WriteLine("Title: {0}", m.Groups["Title"].Value);
    foreach (Capture c in m.Groups["Names"].Captures)
    {
        Console.WriteLine(c.Value);
    }

    Console.WriteLine();
}

The main concept is to use the named "Title" group to store the job titles and reference them later. The names are stored in the capture collection. The pattern will only work if the data is properly formatted of course, as given in your sample data.

The pattern breakdown is as follows: (?<Title>\w+)\s+(?:(?<Names>\w+)(?:,\s+)?)+

  • (?<Title>\w+)\s+ - matches the title before the first space and places it in a named Title group. At least one space must follow.
  • (?:(?\w+)(?:,\s+)?)+ - the name is stored in a Names group via the (?<Names>\w+) part, and a comma and at least one space is matched (but not captured since (?:...) is used) via the (?:,\s+)? part and it is optional since a ? is placed after it. Finally the entire portion of the pattern is enclosed in a group that has to be matched at least once (?:...)+ but is not captured since we only capture the parts we are interested in.
Ahmad Mageed
I have the ability to do as many RegEx.Replace() calls as I need, but I don't have the ability in this tool to write any C# code. I think this is doable in RegEx, it may require balancing groups.
richardtallent
@richard that changes things of course. The problem I see with using only `Regex.Replace` is how to split the names up. `Regex.Replace` itself is recursive in nature, but using my pattern above with a replacement pattern would only give you the job title and the final name (ie. the matched value of the `Names` group). Splitting each name and replacing the results to prefix the title before each one with just `Regex.Replace` is going to be a challenge, if it's even possible. As of right now I can't think of a way to pull that off.
Ahmad Mageed
A: 

You could search for

^(\w+)[ \t]+(\w+),[ \t]+(.+)$

and replace all with

\1 \2\r\n\1 \3

You need to apply it twice to your example, three times if the list of managers grows to four, etc.

So, in C#:

resultString = Regex.Replace(subjectString, @"^(\w+)[ \t]+(\w+),[ \t]+(.+)$", @"$1 $2\r\n$1 $3", RegexOptions.Multiline);

Explanation:

^: Match the start of the line

(\w+)[ \t]+: Match any number of alnum characters, capture the match; match following whitespace

(\w+): Match the next "word", then

,[ \t]+(.+)$ match a comma, spaces and then whatever follows until the end of the line. This will only match if the line still contains content that needs to be split up.

Tim Pietzcker
The actual number of "names" is not discrete... could be 1 or 100. Most lines have fewer than 10, so while I'd rather not leave edge cases lying around, I may have to brute-force it like this.
richardtallent
Well, if you can't write loops in your application you don't have much of a choice...
Tim Pietzcker
+1  A: 

Here's a pure-Replace solution:

string s = @"Managers Alice, Bob, Charlie
Supervisors Don, Edward, Francis";
Regex r = new Regex(@"(?:^\w+)?( \w+)(?<=^(\w+)\b.*)[,\r\n]*",
    RegexOptions.Multiline);
string s1 = r.Replace(s0, "$2$1\r\n");

After each name is matched, the lookbehind goes back to the beginning of the current line to capture the title. The (?:^\w+)? and [,\r\n]* are only there to consume the parts of the string you don't want to keep.

Alan Moore
This worked beautifully, thanks! I made one minor change: added the misc. construct `(?m)` at the beginning to avoid having to set the Multiline flag in the RegEx constructor (the regex is loaded from a table in my case, not called directly, so flags have to be set inline).
richardtallent