views:

185

answers:

2

I'm about to start on a project wherein I can foresee there being large files (mostly flat text files, but could be CSV, fixed-width, XML, ...so far) that need to be edited. I need to develop the pieces to do this editing within the application.

In trying to determine a Good Way to handle editing large amounts of data (possibly into the GB range) without having to load the whole thing, I found that Audacity is able to handle large files quite well. Audacity is open source, so I thought it would make an excellent teaching tool for me in this circumstance. However, I started thinking myself in circles going through the code and now I'm thoroughly confused.

I'm hoping for two results from this question:

  1. A good way to handle this editing without loading the whole file. I thought about loading the data as they edited it, caching it on demand.

  2. An explanation of how Audacity does it.

I'm using C# and .NET, but the answers don't need to be coupled to that environment.

+2  A: 

Sound files are basically a data stream, right? So you don't actually need to deal with the whole file at once. Audacity users may only work with a small snippet of that large file at any given moment.

Hypothetically, if you are adding a 1 second snippet of sound to a large sound file, you only actually have to deal with the entire file when you have to save, at which point you splice together 3 parts: Before, 1 second snippet, and after. So the only thing that needs to actually be in memory is the 1 second snippet, and maybe a small portion of the sound before and after the snippet.

So when you save, you read, say 64 megabytes of file at a time (if you are really aggressive), and stream that out to a temporary file, until you get to your insertion point. Then you stream out the 1 second snippet, stream the remainder of the original file, close the temporary write file, delete the original file, and rename the new file to the original file name.

Of course, it's a little more complicated than this. There might be multiple edits before save, for example, and an undo buffer. But I can pretty much guarantee you that Audacity is limited in unsaved edit complexity by the amount of available RAM.

Robert Harvey
If I could split the credit for the accepted answer, I'd definitely give you some.
oakskc
Good answer for the editing part of the problem -- you've certainly earned my upvote. One solution to the problem of keeping changes in memory is to keep a changelog (or changelogs) on disk which is append-only. As your changelog exceeds memory or new changes invalidate old ones, you write them to the disk changelog. If you periodically write out all unwritten changes from memory to log file, you can also use it to provide security against a program crash. The end result will resemble a log-structured filesystem.
BobMcGee
+2  A: 

Several tricks can make editing simpler and faster.

  1. INDEX it for faster access. While the user is doing nothing, skim through the file and create an index so you can quickly find a specific spot in the file (see below).
  2. Only store changes the user makes. Don't try to apply them directly to the file until the user saves.
  3. Set a limit on how much to read into memory when the user jumps to a point. Read one or two screens of data initially so you can display it, and then if the user doesn't jump to a new spot immediately, read a bit before and a bit after the current spot.

Indexing:

When the user wants to jump to line X or timestamp T, you don't want to skim the whole file counting line breaks and characters. Skim the data, and create a record. Say, every 50 lines, record the byte offset, character count, and line number. This data can be stored in a hashtable, tree, or just an ordered list. Then when user jumps within the file, you can find the nearest index spot and read from there until you find the requested point. This technique is especially useful when working with Unicode, where the number of bytes per character may vary. If files are so large a full index won't fit in memory, you may want to limit the index points and space them more widely, or store the index in a temporary file.

Editing and altering big files:

As suggested by Harvey -- only store changes in memory (as diffs), and then apply them to the file when saved by streaming from input to output. A tree or ordered list may be helpful, so you can quickly find the next place where you need to make a change when writing from input to output.

If changes are too large to fit in memory, you may want to track them in a separate temporary file (perhaps in the same folder as the original). You can just keep writing a continuous list of changes, with new ones appended to this change file. When you save, you'll read through the change list and create a final list of alterations to apply, before deleting the temp file. For performance reasons, it may be helpful to avoid rewriting the change log file; instead, just append to the end of it, and remove redundant or cancelling edits when performing a save.

Interesting fact: the same structures you use for the change log can be used to provide Undo/Redo information.

BobMcGee
I was picturing the change log possibly acting as an undo list, and then you confirmed it there at the end. Thank you for the very comprehensive answer!
oakskc