ansaurus

Question

Remove all but the last 500,000 bytes from a file with the STL

Answer 1

A:

I don't think it is anything computer related, but how you guys have written your logging class. It sounds strange to me that you read the last 500k into a string, why would you do that?

Just append to the logfile.

  fstream myfile;
  myfile.open("test.txt",ios::app);

gbjbaanb 2008-12-06 00:31:13

We're truncating the file to stop it getting too big.

Max Howell 2008-12-06 10:36:56

So when the app starts up, we truncate the log and then append from then onwards.

Max Howell 2008-12-06 11:25:20

Here's an idea - start a new logfile each time. then you have historical logs that won't be overwritten. You can also delete the oldest log (to keep the last 3, say) and you will be in log heaven.

gbjbaanb 2008-12-06 15:34:46

Yeah a good suggestion, and I've toyed with this. We may do it. But I have reservations, eg. describing the steps for people to mail more than one file is hard. And we'd need to keep x days of logs not x logs. As some people restart the app constantly.

Max Howell 2008-12-08 14:09:41

the alternative is to lose data that you might want, even if you truncate the logfile. Much easier to restart the log daily, or on app restart. You can keep as much log as you need (days or lines), and mail lthe log is easy - mail all called "mylog_yyyy_mm_dd_n.log". Simple and highly effective.

gbjbaanb 2008-12-09 22:43:54

Answer 2

A:

Widefinder 2 has a lot of talk about efficient IO available (or, more accurately, the links under the "Notes" column have a lot of information about efficient IO available).

Answering your question:

(Title) Remove first 500,000 bytes from a file with the STL

The STL is very limited when it comes to filesystem operations. If you're not limited to the STL you can end a file prematurely very easily (that is, say "everything after this point is no longer part of this file"), but it's very hard to start a file late ("everything before this point is no longer part of this file").

It would be efficient to simply seek 500,000 bytes into the file and then start a buffered copy to a new file. But once you've done that, the STL doesn't have a ready-made "rename this file" function. Native OS functions can rename files efficiently, as can Boost.Filesystem or STLSoft.

(Actual question) Our logging class, on initialisation, seeks to 500,000 bytes before the end of the file, copies the rest to a std::string and then writes that back to the file.

In this case you're dropping the last bit of the file, and it's very easy to do outside the STL. Simply use the filesystem operations to set the file size to 500,000 bytes. Anything after that will be ignored.

Max Lybbert 2008-12-06 00:40:33

Good news, TR2 will finally have features along the lines of boost.filesystem.

Jasper Bekkers 2008-12-06 12:45:00

Answer 3

A:

So you want the end of the file- you are copying that to some sort of buffer to do what with it? What do you mean 'writes that back' to the file. Do you mean that it overwrites the file, truncating on init to 500k bytes of the original+ what it adds?

Suggestions:

Rethink what you are doing. If this works and is what is desired, what is wrong with it? Why change? is there a performance problem? Are you starting to wonder where all your log entries went? It helps most for this type of question to provide more of the problem than to post the existing behavior. No one can fully comment on this unless they know the complete problem- because it is subjective.
If it were me and I were tasked at reworking your logging mechanism i'd build in a mechanism to cut off the log files to either: length of time or size.

Klathzazt 2008-12-06 00:40:57

We managed to create a multi-gigabyte log file. The program never successfully started up because our truncation code was too slow. GDB suggested it was stuck in seekg(), seeking to 500,000 bytes before the end of the file.

Max Howell 2008-12-06 11:27:23

And yes, what I am trying to say is we take the last 500,000 bytes of the file and make that the whole file.

Max Howell 2008-12-06 11:28:25

And while you are right that context would be interesting or even helpful I think I deliberately gave none because really I am interested in the purely academic exploration of the solution.

Max Howell 2008-12-06 11:31:10

Answer 4

+3 A:

I would probably:

create a new file.
seek in the old file.
do a buffered read/write from old file to new file.
rename the new file over the old one.

To do the first three steps (error-checking omitted, for example I can't remember what seekg does if the file is less than 500k big):

#include <fstream>

std::ifstream ifs("logfile");
ifs.seekg(-500*1000, std::ios_base::end);
std::ofstream ofs("logfile.new");
ofs << ifs.rdbuf();

Then I think you have to do something non-standard to rename the file.

Obviously you need 500k disk space free for this to work, though, so if the reason you're truncating the log file is because it has just filled the disk, this is no good.

I'm not sure why the seek is slow, so I may be missing something. I would not expect seek time to depend on the size of the file. What may depend on the file, is that I'm not sure whether these functions handle 2GB+ files on 32-bit systems.

If the copy itself is slow, then depending on platform you might be able to speed it up by using a bigger buffer, since this reduces the number of system calls and perhaps more importantly the number of times the disk head has to seek between the read point and the write point. To do this:

const int bufsize = 64*1024; // or whatever
std::vector<char> buf(bufsize);
...
ifs.rdbuf()->pubsetbuf(&buf[0], bufsize);

Test it with different values and see. You could also try increasing the buffer for the ofstream, I'm not sure whether that will make a difference.

Note that using my approach on a "live" logging file is hairy. For example, if a log entry is appended between the copy and the rename, then you lose it forever, and any open handles on the file you're trying to replace could cause problems (it'll fail on Windows, and on linux it will replace the file, but the old one will still occupy space and still be written to until the handle is closed).

If the truncation is done from the same thread which is doing all the logging, then there's no problem and you can keep it simple. Otherwise you'll need to use a lock, or a different approach.

Whether this is entirely robust depends on platform and filesystem: move-and-replace may or may not be an atomic operation, but usually isn't, so you may have to rename the old file out of the way, then rename the new file, then delete the old one, and have an error-recovery which on startup detects if there's a renamed old file and, if so, puts it back and restarts the truncate. The STL can't help you deal with platform differences, but there is boost::filesystem.

Sorry there are so many caveats here, but a lot depends on platform. If you're on a PC, then I'm mystified why copying a measly half meg of data takes any time at all.

Steve Jessop 2008-12-06 01:22:06

Answer 5

+5 A:

"I would probably create a new file, seek in the old file, do a buffered read/write from old file to new file, rename the new file over the old one."

I think you'd be better off simply:

#include <fstream>
std::ifstream ifs("logfile");  //One call to start it all. . .
ifs.seekg(-512000, std::ios_base::end);  // One call to find it. . .
char tmpBuffer[512000];
ifs.read(tmpBuffer, 512000);  //One call to read it all. . .
ifs.close();
std::ofstream ofs("logfile", ios::trunc);
ofs.write(tmpBuffer, 512000); //And to the FS bind it.

This avoids the file rename stuff by simply copying the last 512K to a buffer, opening your logfile in truncate mode (clears the contents of the logfile), and spitting that same 512K back into the beginning of the file.

Note that the above code hasn't been tested, but I think the idea should be sound.

You could load the 512K into a buffer in memory, close the input stream, then open the output stream; in this way, you wouldn't need two files since you'd input, close, open, output the 512 bytes, then go. You avoid the rename / file relocation magic this way.

If you don't have an aversion to mixing C with C++ to some extent, you could also perhaps:

(Note: pseudocode; I don't remember the mmap call off the top of my head)

int myfd = open("mylog", O_RDONLY); // Grab a file descriptor
(char *) myptr = mmap(mylog, myfd, filesize - 512000) // mmap the last 512K
std::string mystr(myptr, 512000) // pull 512K from our mmap'd buffer and load it directly into the std::string
munmap(mylog, 512000); //Unmap the file
close(myfd); // Close the file descriptor

Depending on many things, mmap could be faster than seeking. Googling 'fseek vs mmap' yields some interesting reading about it, if you're curious.

HTH

2008-12-06 12:28:33

I have no objection to this other than that it is hard to make it robust against failure. I mean, it's not exactly trivial to make my version robust against failure too, but only because it depends on characteristics of the platform and filesystem. Once you know that you can do the necessary work.

Steve Jessop 2008-12-06 13:03:29

Oh yes, and because of the amount of embedded and mobile programming I've done in the past, putting a 500k buffer on the stack makes my teeth squirm ;-)

Steve Jessop 2008-12-06 13:04:08

I have a feeling the a 500K buffer on the stack will break a couple of compilers (Stack frame size is a compiler controlled). Use a vector<char>

Martin York 2008-12-06 17:47:39

Re: onebyone: Yes, using the heap is probably a better idea :P Use new char[512000] instead.Re: Martin: I don't like the idea of a vector<char>. There's no reason to use a std::vector where a standard array will do; vectors have overhead, and you don't need to ever resize this array. . .

2008-12-06 23:27:06

In this case, the only "overhead" of the vector is that the size, pointer, and capacity might be written to the stack. I suppose if you prefer exception-non-safety, a vector might be considered unnecessary overhead, but you get the RAII more-or-less free on most compilers.

Steve Jessop 2008-12-07 12:24:18

So I'd say the exact oppositve: that there's no need to create an array with new when a vector will do.

Steve Jessop 2008-12-07 12:27:18

Answer 6

+1 A:

If you can generate a logfile of several GB between reinitializations, it seems that truncating the file only at initialization will not really help.

I think that I would try to come up with a specialized text file format in order to always replace contents in place, with a pointer to the "current" line wrapping around. You would need a constant line width to allocate the disk space just once, and put the pointer at either the first or last line of this file.

This way, the file would never grow or shrink, and you would always have the last N entries.

Illustration with N=6 ("|" indicates space padding until there):

#myapp logfile, lines = 6, width = 80, pointer = 4                              |
[2008-12-01 15:23] foo bakes a cake                                             |
[2008-12-01 16:15] foo has completed baking a cake                              |
[2008-12-01 16:16] foo eats the cake                                            |
[2008-12-01 16:17] foo tells bar: I have made you a cake, but I have eaten it   |
[2008-12-01 13:53] bar would like some cake                                     |
[2008-12-01 14:42] bar tells foo: sudo bake me a cake                           |

Svante 2008-12-06 12:53:44

Answer 7

+1 A:

An alternative solution would be to have the logging class detect when the log file size exceeded 500k, and open a new log file, and close the old one.

Then the logging class would look at the old files, and delete the oldest one

The logger would have two configuration parameters.

500k for the threshold of when to start a new log
the number of old logs to keep around.

That way, the logging file management would be a self-maintaining thing.

EvilTeach 2008-12-06 17:28:56

Answer 8

+3 A:

If you happen to use windows, don't bother copying parts around. Simply tell Windows you don't need the first bytes anymore by calling FSCTL_SET_SPARSE and FSCTL_SET_ZERO_DATA

MSalters 2008-12-08 10:29:31

ansaurus

tags:

views:

answers:

Remove all but the last 500,000 bytes from a file with the STL

related questions