views:

253

answers:

5

Given: One big text-data file (e.g. CSV format) with a 'special' first line (e.g., field names).

Wanted: An equivalent of the coreutils split -l command, but with the additional requirement that the header line from the original file appear at the beginning of each of the resulting pieces.

I am guessing some concoction of split and head will do the trick?

+1  A: 

I'm a novice when it comes to Bash-fu, but I was able to concoct this two-command monstrosity. I'm sure there are more elegant solutions.

$> tail -n +2 file.txt | split -l 4
$> for file in `ls xa*`; do echo "`head -1 file.txt`" > tmp; cat $file >> tmp; mv -f tmp $file; done

This is assuming your input file is file.txt, you're not using the prefix argument to split, and you're working in a directory that doesn't have any other files that start with split's default xa* output format. Also, replace the '4' with your desired split line size.

Rob Hruska
+1  A: 

I'm never sure about the rules of copying scripts straight from other people's sites, but Geekology has a nice script to do what you want, with a few comments confirming that it works. Be sure to do tail -n +2 as noted in a comment near the bottom.

Mark Rushakoff
+1  A: 

This is robhruska's script cleaned up a bit:

tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
    head -n 1 file.txt > tmp_file
    cat $file >> tmp_file
    mv -f tmp_file $file
done

I removed wc, cut, ls and echo in the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.

If you want to get fancy, you could use mktemp or tempfile to create a temporary filename instead of using a hard coded one.

Dennis Williamson
This will certainly work. I was just hoping for some slick one-liner like `for $part in (split -l 1000 myfile); cat <(head -n1 myfile) $part > myfile.$part; done`
scrible
That can't work because `split`, of necessity, doesn't output on `stdout`.
Dennis Williamson
`split` *could* output the *names* of the files to stdout, though (as long as we are discussing what `split` *ought* to do :-)
scrible
You're right. That could be *handy*. Sorry I misread your one-liner.
Dennis Williamson
+1  A: 

This is a more robust version of Denis Williamson's script. The script creates a lot of temporary files, and it would be a shame if they were left lying around if the run was incomplete. So, let's add signal trapping (see http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html and then http://tldp.org/LDP/abs/html/debugging.html) and remove our temporary files; this is a best practice anyways.

trap 'rm split_* tmp_file ; exit 13' SIGINT SIGTERM SIGQUIT 
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
    head -n 1 file.txt > tmp_file
    cat $file >> tmp_file
    mv -f tmp_file $file
done

Replace '13' with whatever return code you want. Oh, and you should probably be using mktemp anyways (as some have already suggested), so go ahead and remove 'tmp_file" from the rm in the trap line. See the signal man page for more signals to catch.

Sam Bisbee
+1  A: 

You can use [mg]awk:

awk 'NR==1{
        header=$0; 
        count=1; 
        print header > "x_" count; 
        next 
     } 

     !( (NR-1) % 100){
        count++; 
        print header > "x_" count;
     } 
     {
        print $0 > "x_" count
     }' file

100 is the number of lines of each slice. It doesn't require temp files and can be put on a single line.

marco
Upvoting for teaching me something new, but if I am going to write a small script, I might as well do it in Perl or Python :-)
scrible