tags:

views:

4756

answers:

7

How do you parse a csv file using gawk? Simply setting FS="," is not enough, as a quoted field with a comma inside will be treated as multiple fields.

Example using FS="," which does not work:

file contents:

one,two,"three, four",five
"six, seven",eight,"nine"

gawk script:

BEGIN { FS="," }
{
  for (i=1; i<=NF; i++) printf "field #%d: %s\n", i, $(i)
  printf "---------------------------\n"
}

bad output:

field #1: one
field #2: two
field #3: "three
field #4:  four"
field #5: five
---------------------------
field #1: "six
field #2:  seven"
field #3: eight
field #4: "nine"
---------------------------

desired output:

field #1: one
field #2: two
field #3: "three, four"
field #4: five
---------------------------
field #1: "six, seven"
field #2: eight
field #3: "nine"
---------------------------
A: 

Here's what I came up with. Any comments and/or better solutions would be appreciated.

BEGIN { FS="," }
{
  for (i=1; i<=NF; i++) {
    f[++n] = $i
    if (substr(f[n],1,1)=="\"") {
      while (substr(f[n], length(f[n]))!="\"" || substr(f[n], length(f[n])-1, 1)=="\\") {
        f[n] = sprintf("%s,%s", f[n], $(++i))
      }
    }
  }
  for (i=1; i<=n; i++) printf "field #%d: %s\n", i, f[i]
  print "----------------------------------\n"
}

The basic idea is that I loop through the fields, and any field which starts with a quote but does not end with a quote gets the next field appended to it.

MCS
This looks more C-like.. Are we using the right tool for the right job? I am a novice in awk and can't think of any straight forward solutions though..
Vijay Dev
@Vijay Dev, novice means beginner, not expert.
Robert Gamble
Ah My English !! I wanted to say - 'I am a novice and <em>so</em> I can't think of any straight forward solutions'
Vijay Dev
FYI, this works, but you need "n=0" as a final line for it to function on a multi-line file properly.
GoldenBoy
+1  A: 

I am not exactly sure whether this is the right way to do things. I would rather work on a csv file in which either all values are to quoted or none. Btw, awk allows regexes to be Field Separators. Check if that is useful.

Vijay Dev
Unfortunately exporting Excel files as CSV does stupid shit like this.
Paul Tomblin
I would also go for the regexp approach and try to let it match something like this ^"|","|"$ (this is a quick shot, you have of course to escape the ", I want to keep it simple)
flolo
+5  A: 

The short answer is "I wouldn't use gawk to parse CSV if the CSV contains awkward data", where 'awkward' means things like commas in the CSV field data.

The next question is "What other processing are you going to be doing", since that will influence what alternatives you use.

I'd probably use Perl and the Text::CSV or Text::CSV_XS modules to read and process the data. Remember, Perl was originally written in part as an awk and sed killer - hence the a2p and s2p programs still distributed with Perl which convert awk and sed scripts (respectively) into Perl.

Jonathan Leffler
+2  A: 

If permissible, I would use the Python csv module, paying special attention to the dialect used and formatting parameters required, to parse the CSV file you have.

ayaz
+1  A: 
# csv2delim.awk converts comma delimited files with optional quotes to delim separated file
#     delim can be any characer, defaults to tab
# assumes no repl characters in text, any delim in line converts to repl
#     repl can be any characer, defaults to ~
# changes two consecutive quotes within quotes to '

# usage: gawk -f csv2delim.awk [-v delim=d] [-v repl=`"] input-file > output-file
#    -v delim delimiter, defaults to tab
#    -v repl  relacement char, defaults to ~

# e.g. gawk -v delim=; -v repl=` -f csv2delim.awk test.csv > test.txt

# abe 2-28-7
# abe 8-8-8 1.0 fixed empty fields, added replacment option
# abe 8-27-8 1.1 used split
# abe 8-27-8 1.2 inline rpl and "" = '
# abe 8-27-8 1.3 revert to 1.0 as it is much faster, split most of the time
# abe 8-29-8 1.4 better message if delim present

BEGIN {
    if (delim == "") delim = "\t"
    if (repl == "") repl = "~"
    print "csv2delim.awk v.m 1.4 run at " strftime() > "/dev/stderr" ###########################################
}

{
    #if ($0 ~ repl) {
    # print "Replacment character " repl " is on line " FNR ":" lineIn ";" > "/dev/stderr"
    #}
    if ($0 ~ delim) {
     print "Temp delimitor character " delim " is on line " FNR ":" lineIn ";" > "/dev/stderr"
     print "    replaced by " repl > "/dev/stderr"
    }
    gsub(delim, repl)

    $0 = gensub(/([^,])\"\"/, "\\1'", "g")
#   $0 = gensub(/\"\"([^,])/, "'\\1", "g") # not needed above covers all cases

    out = ""
    #for (i = 1;  i <= length($0);  i++)
    n = length($0)
    for (i = 1;  i <= n;  i++)
     if ((ch = substr($0, i, 1)) == "\"")
      inString = (inString) ? 0 : 1 # toggle inString
     else
      out = out ((ch == "," && ! inString) ? delim : ch)
    print out
}

END {
    print NR " records processed from " FILENAME " at " strftime() > "/dev/stderr"
}

test.csv--------------------------------

"first","second","third"
"fir,st","second","third"
"first","sec""ond","third"
" first ",sec   ond,"third"
"first" , "second","th  ird"
"first","sec;ond","third"
"first","second","th;ird"
1,2,3
,2,3
1,2,
,2,
1,,2
1,"2",3
"1",2,"3"
"1",,"3"
1,"",3
"","",""
"","""aiyn","oh"""
"""","""",""""
11,2~2,3

test.bat--------------------------------

rem test csv2delim
rem default is: -v delim={tab} -v repl=~
gawk                      -f csv2delim.awk test.csv > test.txt
gawk -v delim=;           -f csv2delim.awk test.csv > testd.txt
gawk -v delim=; -v repl=` -f csv2delim.awk test.csv > testdr.txt
gawk            -v repl=` -f csv2delim.awk test.csv > testr.txt


Blockquote

A: 

The above csv2delim.awk is incredibly cool. It worked like a charm for me and it went fast too! Thanks a million!

A: 

I wrote one. I have a new version in the works with lots of changes and improvements. It parses CSV into an array. Works pretty good for most cases. The new version works even better.

http://lorance.freeshell.org/csv/

LoranceStinson