views:

3961

answers:

8

I need to repeatedly remove the first line from a huge text file using a bash script.

Right now I am using sed -i -e "1d" $FILE - but it takes around a minute to do the deletion.

Is there a more efficient way to accomplish this?

A: 

There's no way to delete the file in place using these command line tools, since they take a file from input and rewrite a new file (stream, actually) in output. Perl, AWK, or some weird trick with head and tail will have the same issue.

FWIW, my day job is working on PVRs, and we actually built a special file system (on top of a FAT32 like architecture) that natively supports chopping files from the beginning in linear time (OK, it does a lot more, but that's one of the main requirements).

Mikeage
+3  A: 

No, that's about as efficient as you're going to get. You could write a C program which could do the job a little faster (less startup time and processing arguments) but it will probably tend towards the same speed as sed as files get large (and I assume they're large if it's taking a minute).

But your question suffers from the same problem as so many others in that it pre-supposes the solution. If you were to tell us in detail what you're trying to do rather then how, we may be able to suggest a better option.

For example, if this is a file A that some other program B processes, one solution would be to not strip off the first line, but modify program B to process it differently.

Let's say all your programs append to this file A and program B currently reads and processes the first line before deleting it.

You could re-engineer program B so that it didn't try to delete the first line but maintains a persistent (probably file-based) offset into the file A so that, next time it runs, it could seek to that offset, process the line there, and update the offset.

Then, at a quiet time (midnight?), it could do special processing of file A to delete all lines currently processed and set the offset back to 0.

It will certainly be faster for a program to open and seek a file rather than open and rewrite. This discussion assumes you have control over program B, of course. I don't know if that's the case but there may be other possible solutions if you provide further information.

paxdiablo
A: 

Since it sounds like I can't speed up the deletion, I think a good approach might be to process the file in batches like this:

While file1 not empty
  file2 = head -n1000 file1
  process file2
  sed -i -e "1000d" file1
end

The drawback of this is that if the program gets killed in the middle (or if there's some bad sql in there - causing the "process" part to die or lock-up), there will be lines that are either skipped, or processed twice.

(file1 contains lines of sql code)

Brent
What does the first line contain? Can you just overwrite it with a sql comment as I suggested in my post?
Robert Gamble
A: 

As Pax said, you probably aren't going to get any faster than this. The reason is that there are almost no filesystems that support truncating from the beginning of the file so this is going to be an O(n) operation where n is the size of the file. What you can do much faster though is overwrite the first line with the same number of bytes (maybe with spaces or a comment) which might work for you depending on exactly what you are trying to do (what is that by the way?).

Robert Gamble
A: 

Would using tail on N-1 lines and directing that into a file, followed by removing the old file, and renaming the new file to the old name do the job?

If i were doing this programatically, i would read through the file, and remember the file offset, after reading each line, so i could seek back to that position to read the file with one less line in it.

EvilTeach
The first solution is essentially identical to that Brent is doing now. I don't understand your programmatic approach, only the first line needs to be deleted, you would just read and discard the first line and copy the rest to another file which is again the same as the sed and tail approaches.
Robert Gamble
The second solution has the implication that the file is not shrunk by the first line each time. The program simply processes it, as if it had been shrunk, but starting at the next line each time
EvilTeach
I still don't understand what you second solution is.
Robert Gamble
+7  A: 

Try GNU tail:

tail -n +2 "$FILE"

tail is much faster than sed.

Aaron Digulla
Spot on. Basic tool, guys.
dmckee
A: 

How about using csplit?

man csplit

csplit -k file 1 '{1}'

A: 

If what you are looking to do is recover after failure, you could just build up a file that has what you've done so far.

if [[ -f $tmpf ]] ; then
rm -f $tmpf
fi
cat $srcf|
while read line ; do
// process line
echo "$line" >> $tmpf
done
Tim