tags:

views:

3930

answers:

6

In C#, using the Regex class, how does one parse comma-separated values, where some values might be quoted strings themselves containing commas?

using System ;
using System.Text.RegularExpressions ;

class  Example
    {
    public static void Main ( )
     {
     string  myString  =  "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green, B = blue',bear" ;
     Console.WriteLine ( "\nmyString is ...\n\t" + myString + "\n" ) ;
     Regex   regex  =  new Regex  (  "(?<=,(\"|\')).*?(?=(\"|\'),)|(^.*?(?=,))|((?<=,).*?(?=,))|((?<=,).*?$)"  )  ;
     Match   match  =  regex.Match ( myString ) ;
     int j = 0 ;
     while ( match.Success )
      {
      Console.WriteLine ( j++ + " \t" + match ) ;
      match  =  match.NextMatch() ;
      }
     }
    }

Output (in part) appears as follows:

0       cat
1       dog
2       "0 = OFF
3        1 = ON"
4       lion
5       tiger
6       'R = red
7        G = green
8        B = blue'
9       bear

However, desired output is:

0       cat
1       dog
2       0 = OFF, 1 = ON
3       lion
4       tiger
5       R = red, G = green, B = blue
6       bear
+3  A: 

Try with this Regex:

"[^"\r\n]*"|'[^'\r\n]*'|[^,\r\n]*


    Regex regexObj = new Regex(@"""[^""\r\n]*""|'[^'\r\n]*'|[^,\r\n]*");
    Match matchResults = regexObj.Match(input);
    while (matchResults.Success) 
    {
        Console.WriteLine(matchResults.Value);
        matchResults = matchResults.NextMatch();
    }

Ouputs:

  • cat
  • dog
  • "0 = OFF, 1 = ON"
  • lion
  • tiger
  • 'R = red, G = green, B = blue'
  • bear

Note: This regex solution will work for your case, however I recommend you to use a specialized library like FileHelpers.

CMS
I tried this ---- new Regex ( "[^\"\r\n]*\"|[^,\r\n]*" ) ---- but output was:0 cat,dog,"1 0 = OFF, 1 = ON"23 lion45 tiger67 'R = red89 G = green1011 B = blue'1213 bear14
JaysonFix
edited answer...
CMS
For me, your latest edited answer yields the following output: 0 cat12 dog34 "0 = OFF, 1 = ON"56 lion78 tiger910 'R = red, G = green, B = blue'1112 bear13
JaysonFix
this is the correct regex. it works for me.
Stan R.
@CMS: I was able to get your idea to work for me, but I had to make a small modification (replacing * by + at the end). My Regex is ... new Regex ( @"""[^""\r\n]*""|'[^'\r\n]*'|[^,\r\n]+" )
JaysonFix
Almost forgot: Thank you, CMS, and others for your valuable input.
JaysonFix
Whoops ... spoke too soon ... my "solution" does not work when some fields are "empty" ... see, for example, the two consecutive commas following cat: "cat,,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green, B = blue',bear"
JaysonFix
CSV isn't regular - regexes will *never* be able to parse all valid CVS files. like ever. This is madness except as a dirty hack...
ShuggyCoUk
Here is what I have finally settled on ... string regexString = "" + "((?<=\")[^\"]*(?=\"))" // " ... to ... " + "|((?<=\')[^\']*(?=\'))" // ' ... to ... ' + "|(^[^,]*(?=,))" // ^ ... to ... , + "|((?<=,)[^,]*$)" // , ... to ... $ + "|(?<=,).{0}(?=,)" // ,, + "|(?<=,)([^\"\'].*?)(?=,)" // , ... ( excluding " ' ) to ... , ; Regex regex = new Regex ( regexString ) ;
JaysonFix
+3  A: 

it's not a regex, but I've used Microsoft.VisualBasic.FileIO.TextFieldParser to accomplish this for csv files. yes, it might feel a little strange adding a reference to Microsoft.VisualBasic in a C# app, maybe even a little dirty, but hey it works.

qntmfred
There sample how to use it http://vbnotebookfor.net/2007/05/12/using-textfieldparser/
volody
+10  A: 

Why not heed the advice from the experts and Don't roll your own CSV parser.

Your first thought is, "I need to handle commas inside of quotes."

Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."

It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free and open source FileHelpers library.

Judah Himango
This was my very first thought, too.
codekaizen
Another reason: the CSV spec is actually surprisingly complex, considering all the different ways of quoting and delimiting values. Not to mention the fact that there *isn't* actually a spec for CSV … (There's RfC4180, but that explicitly states that it is only an "attempt" at a common spec, and that actual implementations vary.) So, chances are, if you roll your own, you'll do it wrong.
Jörg W Mittag
+1  A: 

Ah, RegEx. Now you have two problems. ;)

I'd use a tokenizer/parser, since it is quite straightforward, and more importantly, much easier to read for later maintenance.

This works, for example:

using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Text;

class Program
{
    static void Main(string[] args)
    {
        string myString = "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green,     B = blue',bear"; 
        Console.WriteLine("\nmyString is ...\n\t" + myString + "\n");
        CsvParser parser = new CsvParser(myString);

        Int32 lineNumber = 0;
        foreach (string s in parser)
        {
            Console.WriteLine(lineNumber + ": " + s);
        }

        Console.ReadKey();
    }
}

internal enum TokenType
{
    Comma,
    Quote,
    Value
}

internal class Token
{
    public Token(TokenType type, string value)
    {
        Value = value;
        Type = type;
    }

    public String Value { get; private set; }
    public TokenType Type { get; private set; }
}

internal class StreamTokenizer : IEnumerable<Token>
{
    private TextReader _reader;

    public StreamTokenizer(TextReader reader)
    {
        _reader = reader;    
    }

    public IEnumerator<Token> GetEnumerator()
    {
        String line;
        StringBuilder value = new StringBuilder();

        while ((line = _reader.ReadLine()) != null)
        {
            foreach (Char c in line)
            {
                switch (c)
                {
                    case '\'':
                    case '"':
                        if (value.Length > 0)
                        {
                            yield return new Token(TokenType.Value, value.ToString());
                            value.Length = 0;
                        }
                        yield return new Token(TokenType.Quote, c.ToString());
                        break;
                    case ',':
                       if (value.Length > 0)
                        {
                            yield return new Token(TokenType.Value, value.ToString());
                            value.Length = 0;
                        }
                        yield return new Token(TokenType.Comma, c.ToString());
                        break;
                    default:
                        value.Append(c);
                        break;
                }
            }
        }
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

internal class CsvParser : IEnumerable<String>
{
    private StreamTokenizer _tokenizer;

    public CsvParser(Stream data)
    {
        _tokenizer = new StreamTokenizer(new StreamReader(data));
    }

    public CsvParser(String data)
    {
        _tokenizer = new StreamTokenizer(new StringReader(data));
    }

    public IEnumerator<string> GetEnumerator()
    {
        Boolean inQuote = false;
        StringBuilder result = new StringBuilder();

        foreach (Token token in _tokenizer)
        {
            switch (token.Type)
            {
                case TokenType.Comma:
                    if (inQuote)
                    {
                        result.Append(token.Value);
                    }
                    else
                    {
                        yield return result.ToString();
                        result.Length = 0;
                    }
                    break;
                case TokenType.Quote:
                    // Toggle quote state
                    inQuote = !inQuote;
                    break;
                case TokenType.Value:
                    result.Append(token.Value);
                    break;
                default:
                    throw new InvalidOperationException("Unknown token type: " +    token.Type);
            }
        }

        if (result.Length > 0)
        {
            yield return result.ToString();
        }
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}
codekaizen
A: 
         string myString = "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green, B = blue',bear";

         bool inQuotes = false;
         char delim = ',';
         List strings = new List();

         StringBuilder sb = new StringBuilder();
         foreach( char c in myString )
         {
            if(c == '\'' || c == '"')
            {
               if(!inQuotes)
                  inQuotes = true;
               else
                  inQuotes = false;
            }
            if( c == delim )
            {
               if( !inQuotes )
               {
                  strings.Add( sb.Replace("'", string.Empty).Replace("\"", string.Empty ).ToString());
                  sb.Remove( 0, sb.Length );
               }
               else
               {
                  sb.Append( c );
               }
            }
            else
            {
               sb.Append( c );
            }
         }

         strings.Add( sb.Replace( "'", string.Empty ).Replace( "\"", string.Empty ).ToString() );

         foreach( string s in strings )
            Console.WriteLine( s );


Output:

cat
dog
0 = OFF, 1 = ON
lion
tiger
R = red, G = green, B = blue
bear
Partha Choudhury
+2  A: 

CSV is not regular. Unless your regex language has sufficient power to handle the stateful nature of csv parsing (unlikely, the MS one does not) then any pure regex solution is a list of bugs waiting to happen as you hit a new input source that isn't quite handled by the last regex.

CSV reading is not that complex to write as a state machine since the grammar is simple but even so you must consider: quoted quotes, commas within quotes, new lines within quotes, empty fields.

As such you should probably just use someone else's CSV parser. I recommend CSVReader for .Net

ShuggyCoUk