tags:

views:

1326

answers:

6

I wish to copy the top 1000 lines in a text file containing more than 50 million entries, to another new file, and also delete these lines from the original file.

Is there some way to do the same with a single shell command in Unix?

+1  A: 
head -1000 file.txt > first100lines.txt
tail --lines=+1001 file.txt > restoffile.txt
cletus
Upvoted, until I noticed the "and also delete these lines from the original file" requirement.
Brian Campbell
This does not delete lines from the original file.
Alex Reynolds
Have patience. Removing the first 1000 lines and writing it back out takes a long, long time.
le dorfier
A: 

Looks like a job for awk.

starblue
Some people might actually prefer a solution rather than a vague pointer. This seems helpful only in the very broadest sense of that word.
paxdiablo
+9  A: 
head -1000 input > output && sed -i '1,+999d' input

For example:

$ cat input 
1
2
3
4
5
6
$ head -3 input > output && sed -i '1,+2d' input
$ cat input 
4
5
6
$ cat output 
1
2
3
marcog
sed: 1: "input": command i expects \ followed by text
Alex Reynolds
See example -- it works for me.
marcog
This still gives the same error message.
Alex Reynolds
You tried the example I pasted? :-/
marcog
@Alex, do you have a file named 'input'?
Journeyman Programmer
Alex Reynolds
This does not work. Or if it does, it works with a specific version of sed.
Alex Reynolds
I'm using sed 4.1.5
marcog
Okay, I'm using FreeBSD, which does not have a GNU version of sed. I've added an answer that includes a test run of sed vs tail that suggests tail is faster. It is, however, only one test. Nonetheless, head/tail/cp/rm seem to have standard implementations across UNIXes and, if faster, may seem preferable to sed.
Alex Reynolds
+2  A: 

This is a one-liner but uses four atomic commands:

head -1000 file.txt > newfile.txt; tail +1000 file.txt > file.txt.tmp; cp file.txt.tmp file.txt; rm file.txt.tmp
Alex Reynolds
He wants to *move* the first 1000 lines from one file to another. This deletes all but the first 1000 lines, i.e. is wrong.
marcog
You're right. I'll edit this to fix it.
Alex Reynolds
With "more than 50 million entries" that tail will be quite slow.
marcog
Why are you doing "cp file.txt.tmp file.txt; rm file.txt.tmp" instead of "mv file.txt.tmp file.txt"?
Espo
cp and rm are atomic filesystem operations. mv is not.
Alex Reynolds
Please see answer below for one uncached trial each of tail vs sed approaches.
Alex Reynolds
+3  A: 

Out of curiosity, I found a box with a GNU version of sed (v4.1.5) and tested the (uncached) performance of two approaches suggested so far, using an 11M line text file:

$ wc -l input
11771722 input

$ time head -1000 input > output; time tail -n +1000 input > input.tmp; time cp input.tmp input; time rm input.tmp

real    0m1.165s
user    0m0.030s
sys     0m1.130s

real    0m1.256s
user    0m0.062s
sys     0m1.162s

real    0m4.433s
user    0m0.033s
sys     0m1.282s

real    0m6.897s
user    0m0.000s
sys     0m0.159s

$ time head -1000 input > output && time sed -i '1,+999d' input

real    0m0.121s
user    0m0.000s
sys     0m0.121s

real    0m26.944s
user    0m0.227s
sys     0m26.624s

This is the Linux I was working with:

$ uname -a
Linux hostname 2.6.18-128.1.1.el5 #1 SMP Mon Jan 26 13:58:24 EST 2009 x86_64 x86_64 x86_64 GNU/Linux

For this test, at least, it looks like sed is slower than the tail approach (27 sec vs ~14 sec).

Alex Reynolds
+2  A: 

Perl approach:

perl -ne 'if($i<1000) { print; } else { print STDERR;}; $i++;' in 1> in.new 2> out && mv in.new in
piotr