tags:

views:

58

answers:

4

I have a text field that accepts user input in the form of delimeted lists of strings. I have two main delimeters, a space and a comma.

If an item in the list contains more than one word, a user can deliniate it by enclosing it in quotes.

Sample Input:

Apple, Banana Cat, "Dog starts with a D" Elephant Fox "G is tough", "House"

Desired Output:

Apple
Banana
Cat
Dog starts with a D
Elephant
Fox
G is a tough one
House

I've been working on getting a regex for this, and I can't figure out how to allow the commas. Here is what I have so far:

 Regex.Matches(input, @"(?<match>\w+)|\""(?<match>[\w\s]*)""")
             .Cast<Match>()
             .Select(m => m.Groups["match"].Value.Replace("\"", ""))
             .Where(x => x != "")
             .Distinct()
             .ToList()
+2  A: 

That regex is pretty smart if it can turn "G is tough" into G is a tough one :-)

On a more serious note, code up a parser and don't try to rely on a singular regex to do this for you.

You'll find you learn more, the code will be more readable, and you won't have to concern yourself with edge cases that you haven't even figured out yet, like:

Apple, Banana Cat, "Dog, not elephant, starts with a D" Elephant Fox

A simple parser for that situation would be:

state = whitespace
word = ""
for each character in (string + " "):
    if state is whitespace:
        if character is not whitespace:
            word = character
            state = inword
    else:
        if character is whitespace:
            process word
            word = ""
            state = whitespace
        else:
            word = word + character

and it's relatively easy to add support for quoting:

state = whitespace
quote = no
word = ""
for each character in (string + " "):
    if state is whitespace:
        if character is not whitespace:
            word = character
            state = inword
    else:
        if character is whitespace and quote is no:
            process word
            word = ""
            state = whitespace
        else:
            if character is quote:
                quote = not quote
            else:
                word = word + character

Note that I haven't tested these thoroughly but I've done these quite a bit in the past so I'm quietly confident. It's only a short step from there to one that can also allow escaping (for example, if you want quotes within quotes like "The \" character is inside").

To get a single regex capable of handling multiple separators isn't that hard, getting it to monitor state, such as when you're within quotes, so you can treat separators differently, is another level.

paxdiablo
Thanks for this. I was basically hoping to NOT have to write a parser. I definately think you're right though about needing to do it. Looks like good pseudo code. I'm pretty proficient at writing parser, I was just hoping a regex would be availble. Thanks again.
Mark
@Mark, I'd give some serious thought to using a regex to get the next item, then reduce the item list by that amount, something like: (1) strip off `^[ ,]*`, stop if string empty; (2) if next char is `"`, get `^"[^"]*"` and remove `"`'s then strip that length off and go back to 1; (3) get `^[^ ,]*[ ,]` , remove trailing character, strip off that length and go back to 1. That might simplify the parser considerably.
paxdiablo
+1  A: 

You should choose between using space or commas as delimeters. Using both is a bit confusing. If that choice is not yours to make, I would grab things between quotes first. When they are gone, you can just replace all commas with spaces and split the line on spaces.

Brian Clements
A: 

You could perform two regexes. The first one to match the quoted sections, then remove them. With the second regex you could match the remaining words.

string pat = "\"(.*?)\"", pat2 = "(\\w+)";
string x = "Apple, Banana Cat, \"Dog starts with a D\" Elephant Fox \"G is tough\", \"House\"";

IEnumerable<Match> combined = Regex.Matches(Regex.Replace(x, pat, ""), pat2).OfType<Match>().Union(Regex.Matches(x, pat).OfType<Match>()).Where(m => m.Success);

 foreach (Match m in combined)
     Console.WriteLine(m.Groups[1].ToString());

Let me know if this isnt what you were looking for.

Jagermeister
Love the simplicity, but the order is messed up, which I think is a requirement for such things.
Peet Brits
A: 

I like paxdiablo's parser, but if you want to use a single regex, then consider my modified version of a CSV regex parser.

Step 1: the original

string regex = "((?<field>[^\",\\r\\n]+)|\"(?<field>([^\"]|\"\")+)\")(,|(?<rowbreak>\\r\\n|\\n|$))";

Step 2: using multiple delimiters

char quoter = '"';       // quotation mark
string delimiter = " ,"; // either space or comma
string regex = string.Format("((?<field>[^\\r\\n{1}{0}]*)|[{1}](?<field>([^{1}]|[{1}][{1}])*)[{1}])([{0}]|(?<rowbreak>\\r\\n|\\n|$))", delimiter, quoter);

Using a simple loop to test:

Regex re = new Regex(regex);
foreach (Match m in re.Matches(input))
{
    string field = m.Result("${field}").Replace("\"\"", "\"").Trim();
    // string rowbreak = m.Result("${rowbreak}");
    if (field != string.Empty)
    {
        // Print(field);
    }
}

We get the output:

Apple
Banana
Cat
Dog starts with a D
Elephant
Fox
G is tough
House

That's it!

Look at the original CSV regex parser for ideas on handling the matched regex data. You might have to modify it slightly, but you'll get the idea.

Just for interest sake, if you are crazy enough to want to use multiple characters as a single delimiter, then consider this answer.

Peet Brits