views:

3238

answers:

7

I am trying to parse a CSV containing potentially 100k+ lines. Here is the criteria I have:

  1. The index of the identifier
  2. The identifier value

I would like to retrieve all lines in the CSV that have the given value in the given index (delimited by commas).

Any ideas, taking in special consideration for performance?

+3  A: 

First prototype using plain old grep and cut:

grep ${VALUE} inputfile.csv | cut -d, -f${INDEX}

If that's fast enough and gives the proper output, you're done. :)

unwind
+1. This pipeline doesn't allow colon escaping (`\:`) or string quoting (`"foo: bar"`). But it is a good and simple way of solving the problem.
Andrey Vlasovskikh
there's no need to use 2 tools across a pipe. I would recommend using awk.
ghostdog74
@ghostdog: I don't know awk, and looking at e.g. Nate Kohl's awk reply, I think this qualifies as being simpler, at least.
unwind
`$(VALUE)` gives the output of a command named "VALUE". I think you meant to use curly braces for a variable (which should be quoted): `grep "${VALUE}"...`
Dennis Williamson
@Dennis: thanks! I always mess those up. :/
unwind
+1  A: 

A sed or awk solution would probably be shorter, but here's one for Perl:

perl -F/,/ -ane 'print if $F[<INDEX>] eq "<VALUE>"`

where <INDEX> is 0-based (0 for first column, 1 for 2nd column, etc.)

mobrule
+1  A: 

Using awk:

export INDEX=2
export VALUE=bar

awk -F, '$'$INDEX' ~ /^'$VALUE'$/ {print}' inputfile.csv

Edit: As per Dennis Williamson's excellent comment, this could be much more cleanly (and safely) written by defining awk variables using the -v switch:

awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' inputfile.csv

Jeez...with variables, and everything, awk is almost a real programming language...

Nate Kohl
The exports are likely unnecessary. And you should use `awk's` variable-passing feature, otherwise the quoting can get hairy: `awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' inputfile.csv`
Dennis Williamson
+1  A: 
index=1
value=2
awk -F"," -v i=$index -v v=$value '$(i)==v' file
ghostdog74
+2  A: 

CSV isn't quite that simple. Depending on the limits of the data you have, you might have to worry about quoted values (which may contain commas and newlines) and escaping quotes.

So if your data are restricted enough can get away with simple comma-splitting fine, shell script can do that easily. If, on the other hand, you need to parse CSV ‘properly’, bash would not be my first choice. Instead I'd look at a higher-level scripting language, for example Python with a csv.reader.

bobince
+2  A: 

As an alternative to cut- or awk-based one-liners, you could use the specialized csvtool aka ocaml-csv:

$ cat yourfile | csvtool -t ',' col "$index" - | grep "$value"

According to the docs, it handles escaping, quoting, etc.

Andrey Vlasovskikh
A: 

In a CSV file, each field is separated by a comma. The problem is, a field itself might have an embedded comma:

Name,Phone
"Woo, John",425-555-1212

You really need a library package that offer robust CSV support instead of relying on using comma as a field separator. I know that scripting languages such as Python has such support. However, I am comfortable with the Tcl scripting language so that is what I use. Here is a simple Tcl script which does what you are asking for:

#!/usr/bin/env tclsh

package require csv 
package require Tclx

# Parse the command line parameters
lassign $argv fileName columnNumber expectedValue

# Subtract 1 from columnNumber because Tcl's list index starts with a
# zero instead of a one
incr columnNumber -1

for_file line $fileName {
    set columns [csv::split $line]
    set columnValue [lindex $columns $columnNumber]
    if {$columnValue == $expectedValue} {
        puts $line
    }   
}

Save this script to a file called csv.tcl and invoke it as:

$ tclsh csv.tcl filename indexNumber expectedValue

Explanation

The script reads the CSV file line by line and store the line in the variable $line, then it split each line into a list of columns (variable $columns). Next, it picks out the specified column and assigned it to the $columnValue variable. If there is a match, print out the original line.

Hai Vu