tags:

views:

191

answers:

4

I am trying to extract 4th column from csv file (comma separated, and skipping first 2 header lines) using this command,

 awk 'NR <2 {next}{FS =","}{print $4}' filename.csv | more

However, it doesn't work because the first column cantains comma, thus 4th column is not really 4th. Below is an example of a row:

"sdfsdfsd, sfsdf", 454,fgdfg, I_want_this_column,sdfgdg,34546, 456465, etc

A: 

You shouldn't use awk here. Use Python csv module or Perl Text::CSV or Text::CSV_XS modules or another real csv parser.

Related question - http://stackoverflow.com/questions/314384/parse-csv-file-using-gawk

Leonid Shvechikov
+2  A: 

Unless you have specific reasons for using awk, I would recommend using a CSV parsing library. Many scripting languages have one built-in (or at least available) and they'll save you from these headaches.

Benjamin Oakes
+1  A: 

if your first column has quotes always,

 $ awk 'BEGIN{ FS="\042[ ]*," } { m=split($2,a,","); print a[3] } ' file
 I_want_this_column

if the column you want is always the last 2nd,

$ awk -F"," '{print $(NF-1)}' file
 I_want_this_column

You can try this demo script to break down the columns

awk 'BEGIN{ FS="," }
{
   for(i=1;i<=NF;i++){
      # save normal
      if($i !~ /^[ ]*\042|[ ]*\042[ ]*$/){
        a[++j]=$i
      }
      # if quotes at the end
      if(f==1 && $i ~ /[ ]*\042[ ]*$/){
        s=s","$i
        a[++j]=s
        #reset
        s="";f=0
      }
      # if quotes in front
      if($i ~ /^[ ]*\042/){
        s=s $i
        f=1
      }
      if(f==1 && ( $i !~/\042/ ) ){
         s=s","$i
      }
   }
}
END{
  # print columns
  for(p=1;p<=j;p++){
     print "Field "p,": "a[p]
  }
} ' file

output

$ cat file
"sdfsdfsd, sfsdf", "454,fgdfg blah , words ", I_want_this_column,sdfgdg

$ ./shell.sh
Field 1 : "sdfsdfsd, sfsdf"
Field 2 : fgdfg blah
Field 3 :  "454,fgdfg blah , words "
Field 4 :  I_want_this_column
Field 5 : sdfgdg
ghostdog74
It is NOT the case, as it may NOT have the comma in the first column
vehomzzz
A: 

If you can't avoid awk, this piece of code does the job you need:

BEGIN {FS=",";}

{
        f=0;
        j=0;
        for (i = 1; i <=NF ; ++i) {
                if (f) {
                        a[j] = a[j] "," $(i);
                        if ($(i) ~ "\"$") {
                                f = 0;
                        }
                }
                else {
                        ++j;
                        a[j] = $(i);
                        if ((a[j] ~ "^\"[^\"]*$")) {
                                f = 1;
                        }
                }
        }
        for (i = 1; i <= j; ++i) {
                gsub("^\"","",a[i]);
                gsub("\"$","",a[i]);
                gsub("\"\"","\"",a[i]);
print "i = \"" a[i] "\"";
        }
}
Giuseppe Guerrini
it breaks when the comma has spaces after/before it. eg try on this data: `"sdfsdfsd, sfsdf" , "454,fgdfg", I_want_this_column`
ghostdog74
The original question stated 'FS =","', so I guess spaces are not an issue.
Giuseppe Guerrini