tags:

views:

952

answers:

7

I need a regex that will parse a csv-style file, something like 57 fields wide, most fields enclosed in quotes (but maybe not all), separated by commas, with quoted fields having the possibility of embedded doubles ("") that represent single quotes in the evaluated string.

I'm a regex beginner/intermediate, and I think I can get pretty quickly to the basic expression to do the field parsing, but it's the embedded double-quotes (and commas) I can't get my head around.

Anyone? (Not that it matters but specific language is Matlab.)

+1  A: 

escape the quotes - ? makes it optional.

\"?
Josh
But don't I need to identify the surrounding quotes in the regex? That's the part I'm getting confused over. A string value will be surrounded by quotes and separated from other fields by commas, but such a value can contain embedded double-quote bits, and embedded commas. I just can't see how to write this one.
John Pirie
+4  A: 

I know there i great hype around regular expressions nowadays, but I would really recommend using a library for tasks that have already been implemented by others - it will be easier to implement, easier to read and easier to maintain (want to read csvs separated by quotes next time? The library can possibly do it, but your regex will need a rewrite). A quick google search should give you a good start.

soulmerge
Agreed. You can look on MATLAB Central for M-code files, but they tend to be immature. But Matlab can pull many Java libraries in easily. I've had luck using the OpenCSV Java library in Matlab, writing a thin M-code wrapper for it.
Andrew Janke
A: 

If you really have to do it with a regex, I would do it in two passes; firstly separate the fields by splitting on the commas with something such as:

regexp(theString, '(?<!\\),', 'split');

This should split on commas, only when there isn't a preceding slash (I'm assuming this is what you mean by escaped commas). (I think in matlab you'll end up with an array of indexes into the original strings)

Then you should check each matched field for escaped quotes, and replace them with something like:

regexprep(individualString, '""', '"');

Similarly for commas:

regexprep(individualString, '\\,', ',');

I'm not sure about the doubly escaped \'s in matlab having not had much experience with it.

As others have said, it's probably better to use a csv library for handling the initial file.

owst
We basically ended up iterating item-by-item on each line, first matching and then substituting; so this gets the check for being closest. But see my posted answer.
John Pirie
does this work for commas inside quotes? i never could get regexps to work w/ quoted strings.
Jason S
The first regexp will split on all commas that aren't preceded by a backslash, (to separate into fields). So unless all commas are properly escaped then you could get some interesting results! It is a very simple regexp (probably too simple :)), it doesn't attempt to check for anything inside quotes
owst
A: 

It took me a while to work this out, since many of the regexp's on the net don't handle one part or another. Here is code in F#/.NET. Sorry, but I don't speak matlab:

let splitCsv (s:string) =
    let re = new Regex("\\s*((?:\"(?:(?:\"\")|[^\"])*\")|[^\"]*?)\\s*(?:,|$)")

    re.Matches( s + " ")
    |> Seq.cast<Match>
    |> Seq.map (fun m -> m.Groups.[1].Value)
    |> Seq.map (fun s -> s.Replace( "\"\"", "\"" ))
    |> Seq.map (fun s -> s.Trim( [| '"'; ' ' |] ))
    |> List.of_seq

This version handles quoted strings, quotes escaped as double-quotes, and trims extra (escaped) quotes and spaces around the whole string (original: "Test", double-quoted: """Test"""). It also properly handles an empty field in the last position (hence the s + " ") and it also properly handles commas inside quoted strings.

James Hugard
A: 

Thanks for replies. Classic case of beginner thinking the problem is easy, experts knowing the problem is hard.

After reading your posts, I browsed for a canned csv parser library in Matlab; found a couple, neither of which could get the job done (first tried to do whole file at once, failed on memory; second failed to my specific bugaboo, doubled quotes in a quoted string).

So we rolled our own, with the help of a regex I found on the web and modified. Remains to be moved to Matlab but Python code is as follows:

import re

text = ["<omitted>"]

# Regex: empty before comma OR string w/ no quote or comma OR quote-surrounded string w/ optional doubles
p = re.compile('(?=,)|[^",]+|"(?:[^"]|"")*"')

for line in text:
    print 'Line: %s' % line
    m = p.search(line)                                  
    fld = 1
    while m:                                            
        val = m.group().strip('"').replace('""', '"')   
        print 'Field %d: %s' % (fld, val)
        line = re.sub(p, '', line, 1)        
        if line and line[0] == ',':          
            line = line[1:]
        fld += 1
        m = p.search(line)                   
    print
John Pirie
A: 

Page 271 of Friedl's Mastering Regular Expressions has a regular expression for extracting possibly quoted CSV fields, but it requires a bit of postprocessing:

>>> re.findall('(?:^|,)(?:"((?:[^"]|"")*)"|([^",]*))', '"a,b,c",d,e,f')
[('a,b,c', ''), ('', 'd'), ('', 'e'), ('', 'f')]
>>> re.findall('(?:^|,)(?:"((?:[^"]|"")*)"|([^",]*))', '"a,b,c",d,,f')
[('a,b,c', ''), ('', 'd'), ('', ''), ('', 'f')]

Same pattern with the verbose flag:

csv = re.compile(r"""
    (?:^|,)
    (?: # now match either a double-quoted field
        # (inside, paired double quotes are allowed)...
        " # (double-quoted field's opening quote)
          (    (?: [^"] | "" )*    )
        " # (double-quoted field's closing quote)
    |
      # ...or some non-quote/non-comma text...
        ( [^",]* )
    )""", re.X)
Greg Bacon
A: 

It's possible to do using a single regex with lookahead. Illustrated here in perl:

my @rows;

foreach my $line (@lines) {

    my @cells;
    while ($line =~ /( ("|').+?\2 | [^,]+? ) (?=(,|$))/gx) {
        push @cells, $1;
    }

    push @rows, \@cells;
}