tags:

views:

114

answers:

2

I need to parse strings inputs where the columns are separated by columns and any field that contains a comma in the data is wrapped in quotes (commas separated, quoted text identifiers). For this project I need to remove the quotes and any commas that occur between pairs of quotes. Basically, I need to remove commas and quotes that are contained in fields while preserving the commas that are used to separate the fields. Here's a little code I put together that handles the simple scenario:

    // Sample input 1: This works and covers 99% of the records that I need to parse.
    string str1 = "[email protected],2010/03/27 12:2:02,,some_first_name,some_last_name,,\"This Address Works, Suite 200\",Some City,TN,09876-5432,9795551212x123,XYZ";
    str1 = Regex.Replace(str1, "\"([^\"^,]*),([^\"^,]*)\"", "$1$2");
    Console.WriteLine(str1);
    // Outputs: [email protected],2010/03/27 12:2:02,,some_first_name,some_last_name,,This Address Works Suite 200,Some City,TN,09876-5432,9795551212x123,XYZ

Although this code works for most of my records, it doesn't work when a field contains more than one commas. What I would like to do is modify the code so that it remove each instance of a comma contained within the column no matter how many commas there are in the field. I don't want to hard code only handling 2 commas, or 3 commas, or 25 commas. The code should just remove all the commas in the field. Below is an example of what my code doesn't handle properly.

    // Sample input 2: This doesn't work since there is more than 1 comma between the quotes.
    string str2 = "[email protected],2010/03/27 12:2:02,,some_first_name,some_last_name,,\"i,l,k,e, c,o,m,m,a,s, i,n ,m,y, f,i,e,l,d\",Some City,TN,09876-5432,9795551212x123,XYZ";
    str2 = Regex.Replace(str2, "\"([^\"^,]*),([^\"^,]*)\"", "$1$2");
    Console.WriteLine(str2);
    // Desired output: [email protected],2010/03/27 12:2:02,,some_first_name,some_last_name,,i like commas in my field,Some City,TN,09876-5432,9795551212x123,XYZ

Any help would be appreciated for this Regular Expression newbie.

+2  A: 

Matching quotes and regular expression don't go hand in hand, and you are probably better of using a CSV parser, as Michael Madsen suggested.
However, if you know the quotes only occur as you expect, you can do something like the following:

str2 = Regex.Replace(str2, "\"[^\"]*\"",
                     match => match.Value.Trim('\"').Replace(",", ""));
Kobi
Shouldn't that last bit be `.Replace(",", "")`?
Alan Moore
@Alan Moore - Of course. Thanks.
Kobi
This did exactly what I needed it to do. Thanks! I will have to google what the match => match. syntax means. I've never seen it before and that definately did the trick. I was afraid I would have to loop through match groups or something else that didn't seem particularly elegant. This is exactly what I was hoping to get.
Registered User
@Registered User - Thanks! That's a lambda expression by the way - an anonymous function, just like `delegate` from previous versions of C#. In a way it does look over matches, but in a subtle way.
Kobi
+2  A: 

Here's a pure regex version:

str2 = Regex.Replace(str0, 
          @"""|,(?=(?>[^""]*""[^""]*(?:""[^""]*""[^""]*)*)$)", 
          String.Empty);

It matches any quotation mark, or a comma if it's followed by an odd number of quotation marks, and replaces it with nothing.

I would only go this route if I absolutely had to, for example if I were working with a framework that only let me specify the regex and the replacement string. Otherwise, I would either go with @Kobi's approach (because it's so much more readable) or use a dedicated CSV processor. They're not hard to find.

Alan Moore
I went with Kobi's approach because it doesn't have to solely be limited to regex. I just assumed there had to be a way to do it with regex since it seemed like an appropriate tool for the problem. I am definately going to disect your code, though, to see if I can understand it and learn something useful from it.
Registered User