views:

571

answers:

5

I have a situation in a code where there is a huge function that parses records line-by-line, validates and writes to another file.

In case there are errors in the file, it calls another function that rejects the record and writes the reject reason.

Due to a memory leak in the program, it crashes with SIGSEGV. One solution to kind of "Restart" the file from where it crashed, was to write the last processed record to a simple file.

To achieve this the current record number in the processing loop needs to be written to a file. How do I make sure that the data is overwritten on the file within the loop?

Does using fseek to first position / rewind within a loop degrade the performance?

The number of records can be lot, at times (upto 500K).

Thanks.

EDIT: The memory leak has already been fixed. The restart solution was suggested as an additional safety measure and means to provide a restart mechanism along with a SKIP n records solution. Sorry for not mentioning it earlier.

+2  A: 

If you can change the code to have it write the last processed record to a file, why can't you change it to fix the memory leak?

It seems to me to be a better solution to fix the root cause of the problem rather than treat the symptoms.

fseek() and fwrite() will degrade the performance but nowhere near as much as a open/write/close-type operation.

I'm assuming you'll be storing the ftell() value in the second file (so you can pick up where you left off). You should always fflush() the file as well to ensure that data is written from the C runtime library down to the OS buffers. Otherwise your SEGV will ensure the value isn't up to date.

paxdiablo
Sorry, I forgot to mention this. The memory leak has been fixed. This is an additional safety measure that has been requested by the client.
prabhu
You should edit your question to note that.
Dana the Sane
Then just make the customer aware there will be a processing delay. Checkpointing is not a free operation. You may have to do some benchmarking to let them know what the impact will be.
paxdiablo
He does not need to checkpoint at every record, and in some instances (e.g. all rejects, or no rejects) not at all. There is enough state information in the output files themselves. :)
vladr
+2  A: 

Rather than writing out the entire record, it would probably be easier to call ftell() at the beginning of each, and write the position of the file pointer. When you have to restart the program, fseek() to the last written position in the file and continue.

Of course, fixing the memory leak would be best ;)

Dana the Sane
A: 

If you write the last processed position for every record, this will have a noticeable impact on performance because you will need to commit the write (typically by closing the file) and then reopen the file again. In other works, the fseek is the least of your worries.

jdigital
fflush (and possibly fsync) may be faster than closing and reopening the file, but not much, I suspect.
paxdiablo
+2  A: 

When faced with this kind of problem, you can adopt one of two methods:

  1. the method you suggested: for each record you read, write out the record number (or the position returned by ftell on the input file) to a separate bookmark file. To ensure that you resume exactly where you left off, as to not introduce duplicate records, you must fflush after every write (to both bookmark and output/reject files.) This, and unbuffered write operations in general, slow down the typical (no-failure) scenario significantly. For completeness' sake, note that you have three ways of writing to your bookmark file:
    • fopen(..., 'w') / fwrite / fclose - extremely slow
    • rewind / truncate / fwrite / fflush - marginally faster
    • rewind / fwrite / fflush - somewhat faster; you may skip truncate since the record number (or ftell position) will always be as long or longer than the previous record number (or ftell position), and will completely overwrite it, provided you truncate the file once at startup (this answers your original question)
  2. assume everything will go well in most cases; when resuming after failure, simply count the number of records already output (normal output plus rejects), and skip an equivalent number of records from the input file.
    • This keeps the typical (no-failure) scenarios very fast, without significantly compromising performance in case of resume-after-failure scenarios.
    • You do not need to fflush files, or at least not so often. You still need to fflush the main output file before switching to writing to the rejects file, and fflush the rejects file before switching back to writing to the main output file (probably a few hundred or thousand times for a 500k-record input.) Simply remove the last unterminated line from the output/reject files, everything up to that line will be consistent.

I strongly recommend method #2. The writing entailed by method #1 (whichever of the three possibilities) is extremely expensive compared to any additional (buffered) reads required by method #2 (fflush can take several milliseconds; multiply that by 500k and you get minutes - whereas counting the number of lines in a 500k-record file takes mere seconds and, what's more, the filesystem cache is working with, not against you on that.)


EDIT Just wanted to clarify the exact steps you need to implement method 2:

  • when writing to the output and rejects files respectively you only need to flush when switching from writing to one file to writing to another. Consider the following scenario as illustration of the ncessity of doing these flushes-on-file-switch:

    • suppose you write 1000 records to the main output file, then
    • you have to write 1 line to the rejects file, without manually flushing the main output file first, then
    • you write 200 more lines to the main output file, without manually flushing the rejects file first, then
    • the runtime automatically flushes the main output file for you because you have accumulated a large volume of data in the buffers for the main output file, i.e. 1200 records
      • but the runtime has not yet automatically flushed the rejects file to disk for you, as the file buffer only contains one record, which is not sufficient volume to automatically flush
    • your program crashes at this point
    • you resume and count 1200 records in the main output file (the runtime flushed those out for you), but 0 (!) records in the rejects file (not flushed).
    • you resume processing the input file at record #1201, assuming you only had 1200 records successfully processed to the main output file; the rejected record would be lost, and the 1200'th valid record will be repeated
    • you do not want this!
  • now consider manually flushing after switching output/reject files:
    • suppose you write 1000 records to the main output file, then
    • you encounter one invalid record which belongs to the rejects file; the last record was valid; this means you're switching to writing to the rejects file: flush the main output file before writing to the rejects file
    • you now write 1 line to the rejects file, then
    • you encounter one valid record which belongs to the main output file; the last record was invalid; this means you're switching to writing to the main output file: flush the rejects file before writing to the main output file
    • you write 200 more lines to the main output file, without manually flushing the rejects file first, then
    • assume that the runtime did not automatically flush anything for you, because 200 records buffered since the last manual flush on the main output file are not enough to trigger an automatic flush
    • your program crashes at this point
    • you resume and count 1000 valid records in the main output file (you manually flushed those before switching to the rejects file), and 1 record in the rejects file (you manually flushed before switching back to the main output file).
    • you correctly resume processing the input file at record #1001, which is the first valid record immediately after the invalid record.
    • you reprocess the next 200 valid records because they were not flushed, but you get no missing records and no duplicates either
  • if you are not happy with the interval between the runtime's automatic flushes, you may also do manual flushes every 100 or every 1000 records. This depends on whether processing a record is more expensive than flushing or not (if procesing is more expensive, flush often, maybe after each record, otherwise only flush when switching between output/rejects.)

  • resuming from failure

    • open the output file and the rejects file for both reading and writing, and begin by reading and counting each record (say in records_resume_counter) until you reach the end of file
    • unless you were flushing after each record you are outputting, you will also need to perform a bit of special treatment for the last record in both the output and rejects file:
      • before reading a record from the interrupted output/rejects file, remember the position you are at in the said output/rejects file (use ftell), let's call it last_valid_record_ends_here
      • read the record. validate that the record is not a partial record (i.e. the runtime has not flushed the file up to the middle of a record).
      • if you have one record per line, this is easily verified by checking that the last character in the record is a carriage return or line feed (\n or \r)
        • if the record is complete, increment the records counter and proceed with the next record (or end of file, whichever comes first.)
        • if the record is partial, fseek back to last_valid_record_ends_here, and stop reading from this output/reject files; do not increment the counter; proceed to the next output or rejects file unless you've gone through all of them
    • open the input file for reading and skip records_resume_counter records from it
      • continue processing and outputting to the output/rejects file; this will automatically append to the output/rejects file where you left off reading/counting already processed records
      • if you had to perform special processing for partial record flushes, the next record you output will overwrite its partial information from the previous run (at last_valid_record_ends_here) - you will have no duplicate, garbage or missing records.

Cheers, V.

vladr
Wow, this one wins by sheer weight! :-)
paxdiablo
Thanks a ton for the eye-openers! You rock!!
prabhu
A: 

I would stop digging a deeper hole and just run the program through Valgrind. Doing so should obviate the leak, as well as other problems.

Tim Post