views:

34

answers:

6

I have a CSV file (foo.csv) with 200,000 rows. I need to break it into four files (foo1.csv, foo2.csv... etc.) with 50,000 rows each.

I already tried simple ctrl-v/-c using gui text editors, but the my computer slows to a halt.

What unix command(s) could I use to accomplish this task?

+1  A: 
sed -n 2000,4000p somefile.txt

will print from lines 2000 to 4000 to stdout.

deinst
+2  A: 

I don't have a terminal handy to try it out, but it should be just split -d -l 50000 foo.csv.

Hopefully the naming isn't terribly important because with the -d option, the output files will be named foo.csv00 .. foo.csv03. You can add the -a 1 option so that the suffixes are 0-3, but there's no simple way to get the suffix to be injected into the middle of the filename.

Mark Rushakoff
A: 

you should use head and tail.

head -n 50000 myfile > part1.csv
head -n 100000 myfile | tail -n 50000 > part2.csv 
head -n 150000 myfile | tail -n 50000 > part3.csv 

etc ...

Else, but with no control on file names, you can use unix command split.

Guillaume Lebourgeois
+1  A: 

split -l50000 foo.csv

Jeremy
A: 

You can use sed

Jon Freedman
A: 

I wrote this little shell script for this topic very similar at yours.

This shell script + awk works fine for me:

#!/bin/bash
awk -v initial_line=$1 -v end_line=$2 '{
    if (NR >= initial_line && NR <= end_line) 
    print $0
}' $3

Used with this sample file (file.txt):

one
two
three
four
five
six

The command (it will extract from second to fourth line in the file):

edu@debian5:~$./script.sh 2 4 file.txt

Output of this command:

two
three
four

Of course, you can improve it, for example by testing that all argument values are the expected :-)

SourceRebels