tags:

views:

123

answers:

3

I have a CSV file that I'd like to split up based on a field in the file. Essentially, there can be two brands, GVA and HBVL. I'd like to split the file into a file for each brand before I import it into a database.

Sample of the CSV file

"D509379D5055821451C3695A3752DCCD",'1900-01-01 01:00:00',"M","1740","GVA",'2009-07-01 13:25:00',0
"159A58BE41012787D531C7157F688D86",'1900-01-01 00:00:00',"V","1880","GVA",'2008-06-06 11:21:00',0
"D0BB5C058794BBE4478DDA536D1E4872",'1900-01-01 00:00:00',"M","9270","GVA",'2007-09-18 13:21:00',0
"BCC7096803E5E60E05DC12FB9951E0CF",'1900-01-01 00:00:00',"M","3500","HBVL",'2007-09-18 13:21:00',1
"7F85FCE6F13775A8A3054E3438B81599",'1900-01-01 00:00:00',"M","3970","HBVL",'2007-09-18 13:20:00',0

Part of the problem is the size of the file. It's about 39mb. My original attempt at this looked like this:

while read line ; do

    name=`echo $line | sed -n 's/\(.*\)"\(GVA\|HBVL\)",\(.*\)$/\2/ p' | tr [:upper:] [:lower:] `
    info=`echo $line | sed -n 's/\(.*\)"\(GVA\|HBVL\)",\(.*\)$/\1\3/ p'`

    echo "${info}" >> ${BASEDIR}/${today}/${name}.txt

done < ${file}

After about 2.5 hours, only about 1/2 of the file had been processed. I have another file that could potentially be up to 250 mb in size and I can't imagine how long that would take.

What I'd like to do is pull out the brand out of the line and write the line to a file named after the brand. I can remove the brand, but I don't now how to use it to create a file. I've started in sed, but I'm not above using another language if it's more appropriate.

+3  A: 

The original while loop with multiple commands per line is DIRE!

sed -e '/"GVA"/w gva.file' -e '/"HBVL"/w hbvl.file' -n $file

The sed script says:

  • write lines that match the GVA tag to gva.file
  • write lines that match the HBVL tag to hbvl.file
  • and don't print anything else ('-n')

Note that different versions of sed can handle different numbers of auxilliary files. If you need more than, say, twenty output files at once, you may need to look at other technology (but test what the limit is on your machine). If the file is sorted so that all the GVA records appear together followed by all the HBVL records, you could consider using csplit. Alternatively, a scripting language like Perl could handle more. If you exceed the number of file descriptors allowed to your process, it becomes hard to do the splitting in a single pass over the data file.

Jonathan Leffler
Note: You can still strip out the brand field using this technique: `sed -n -e '/"GVA"/ s/\(.*\)"\(GVA\)",\(.*\)$/\1\3/ w gva.file' -e '...`
Dennis Williamson
I knew I was over-thinking things. I really shouldn't try solving problems like this so late in the day.
cabuki
+1  A: 
grep '"GVA"' $file >GVA.txt
grep '"HVBL"' $file >HVBL.txt
Mark Ransom
Two commands per file - better than several per line, but still slower...
Jonathan Leffler
It might not be the fastest solution, but it's probably close enough. And certainly easy enough to understand.
Mark Ransom
if the file is very big, then its no point going through 2 iterations of the file using grep. Its better to go through the file once and get the things you need.
ghostdog74
+1  A: 
# awk -F"," '{o=$5;gsub(/\"/,"",o);print $0 > o}' OFS="," file
# more GVA
"D509379D5055821451C3695A3752DCCD",'1900-01-01 01:00:00',"M","1740","GVA",'2009-07-01 13:25:00',0
"159A58BE41012787D531C7157F688D86",'1900-01-01 00:00:00',"V","1880","GVA",'2008-06-06 11:21:00',0
"D0BB5C058794BBE4478DDA536D1E4872",'1900-01-01 00:00:00',"M","9270","GVA",'2007-09-18 13:21:00',0
# more HBVL
"BCC7096803E5E60E05DC12FB9951E0CF",'1900-01-01 00:00:00',"M","3500","HBVL",'2007-09-18 13:21:00',1
"7F85FCE6F13775A8A3054E3438B81599",'1900-01-01 00:00:00',"M","3970","HBVL",'2007-09-18 13:20:00',0
ghostdog74
...hope there are no commas in those quoted literals...
Steven Huwig