views:

217

answers:

6

In windows is it possible through an API to write to the middle of a file without overwriting any data and without having to rewrite everything after that?

If it's possible then I believe it will obviously fragment the file; how many times can I do it before it becomes a serious problem?

If it's not possible what approach/workaround is usually taken? Re-writing everything after the insertion point becomes prohibitive really quickly with big (ie, gigabytes) files.


Note: I can't avoid having to write to the middle. Think of the application as a text editor for huge files where the user types stuff and then saves. I also can't split the files in several smaller ones.

+6  A: 

I'm unaware of any way to do this if the interim result you need is a flat file that can be used by other applications other than the editor. If you want a flat file to be produced, you will have to update it from the change point to the end of file, since it's really just a sequential file.

But the italics are there for good reason. If you can control the file format, you have some options. Some versions of MS Word had a quick-save feature where they didn't rewrite the entire document, rather they appended a delta record to the end of the file. Then, when re-reading the file, it applied all the deltas in order so that what you ended up with was the right file. This obviously won't work if the saved file has to be usable immediately to another application that doesn't understand the file format.

What I'm proposing there is to not store the file as text. Use an intermediate form that you can efficiently edit and save, then have a step which converts that to a usable text file infrequently (e.g., on editor exit). That way, the user can save as much as they want but the time-expensive operation won't have as much of an impact.

Beyond that, there are some other possibilities.

Memory-mapping (rather than loading) the file may provide efficiences which would speed things up. You'd probably still have to rewrite to the end of the file but it would be happening at a lower level in the OS.

If the primary reason you want fast save is to start letting the user keep working (rather than having the file available to another application), you could farm the save operation out to a separate thread and return control to the user immediately. Then you would need synchronisation between the two threads to prevent the user modifying data yet to be saved to disk.

paxdiablo
+1 for memory mapping; still, be careful with formats like quick-saved word documents: eventually you'll get a huge file filled with old data. This can be a problem since (1) it wastes disk space and (2) data that the user thought deleted will still be there, so an apparently empty file may still contain sensitive information. IIRC for these motivations in one of the last versions of Office (it may be 2003, but I'm not sure) Microsoft turned off the quick save feature by default: being disks much faster than before, the disadvantages of this technique overweighted the advantages.
Matteo Italia
I think from memory Word had a threshold beyond which it would write the real file rather than another delta, which would solve the first problem. But you're right about the sensitive data, I've seen stuff in documents that was not meant to be seen :-)
paxdiablo
+2  A: 

I'm not sure about the format of your file but you could make it 'record' based.

  • Write your data in chunks and give each chunk an id.
  • Id could be data offset in file.
  • At the start of the file you could have a header with a list of ids so that you can read records in order.
  • At the end of 'list of ids' you could point to another location in the file (and id/offset) that stores another list of ids

Something similar to filesystem.

To add new data you append them at the end and update index (add id to the list).

You have to figure out how to handle delete record and update.

If records are of the same size then to delete you can just mark it empty and next time reuse it with appropriate updates to index table.

stefanB
A: 

If using .NET 4 try a memory-mapped file if you have an editor-like application - might jsut be the ticket. Something like this (I didn't type it into VS so not sure if I got the syntax right):

MemoryMappedFile bigFile = MemoryMappedFile.CreateFromFile(
   new FileStream(@"C:\bigfile.dat", FileMode.Create),
       "BigFileMemMapped",
       1024 * 1024,
       MemoryMappedFileAccess.ReadWrite);
MemoryMappedViewAccessor view = MemoryMapped.CreateViewAccessor();
int offset = 1000000000;
view.Write<ObjectType>(offset, ref MyObject);
Canoehead
+3  A: 

The realistic answer is no. Your only real choices are to rewrite from the point of the modification, or build a more complex format that uses something like an index to tell how to arrange records into their intended order.

From a purely theoretical viewpoint, you could sort of do it under just the right circumstances. Using FAT (for example, but most other file systems have at least some degree of similarity) you could go in and directly manipulate the FAT. The FAT is basically a linked list of clusters that make up a file. You could modify that linked list to add a new cluster in the middle of a file, and then write your new data to that cluster you added.

Please note that I said purely theoretical. Doing this kind of manipulation under a complete unprotected system like MS-DOS would have been difficult but bordering on reasonable. With most newer systems, doing the modification at all would generally be pretty difficult. Most modern file systems are also (considerably) more complex than FAT, which would add further difficulty to the implementation. In theory it's still possible -- in fact, it's now thoroughly insane to even contemplate, where it was once almost reasonable.

Jerry Coffin
Direct file system modification in a modern OS is braindead: you'd have to understand how several file systems work (quite a difficult matter), write a driver for that with the extended capabilities you'd need, and IFS drivers are black magic also for "normal" driver writers; moreover, you'd tie your application to just a few filesystems. All this for a performance improvement that often will be negligible.And, by the way, if the text inserted in the middle wasn't of the size of the clusters, there would be no performance advantage at all.
Matteo Italia
A: 

Probably the most efficient way to do this (if you really want to do it) is to call ReadFileGather() to read the chunks before and after the insertion point, insert the new data in the middle of the FILE_SEGMENT_ELEMENT[3] list, and call WriteFileGather(). Yes, this involves moving bytes on disk. But you leave the hard parts to the OS.

MSalters
A: 

I noted both paxdiablo's answer on dealing with other applications, and Matteo Italia's comment on Installable File Systems. That made me realize there's another non-trivial solution.

Using reparse points, you can create a "virtual" file from a base file plus deltas. Any application unaware of this method will see a continuous range of bytes, as the deltas are applied on the fly by a file system filter. For small deltas (total <16 KB), the delta information can be stored in the reparse point itself; larger deltas can be placed in an alternative data stream. Non-trivial of course.

MSalters