views:

225

answers:

7

I need to hold a representation of a document in memory, and am looking for the most efficient way to do this.

Assumptions

  • The documents can be pretty large, up to 100MB.
  • More often than not the document will remain unchanged - (i.e. I don't want to do unnecessary up front processing).
  • Changes will typically be quite close to each other in the document (i.e. as the user types).
  • It should be possible to apply changes fast (without copying the whole document)
  • Changes will be applied in terms of offsets and new/deleted text (not as line/col).
  • To work in C#

Current considerations

  • Storing the data as a string. Easy to code, fast to set, very slow to update.
  • Array of Lines, moderatly easy to code, slower to set (as we have to parse the string into lines), faster to update (as we can insert remove lines easily, but finding offsets requires summing line lengths).

There must be a load of standard algorithms for this kind of thing (it's not a million miles of disk allocation and fragmentation).

Thanks for your thoughts.

+1  A: 

I would suggest you to take a look at Memory Mapped Files (MMF).

Some pointers:

http://stackoverflow.com/questions/357170/memory-mapped-files-net

http://msdn.microsoft.com/en-us/library/ms810613.aspx

Magnus Johansson
The problem with this is, that you will have to copy 100 MiB of data every time the user inserts one byte at the begining of a 100 MiB file.
Daniel Brückner
Well, one of the requirements were "More often than not the document will remain unchanged". Also, you don't have to do the copying yourself. But still, your argument is nevertheless true.
Magnus Johansson
...I'm also a big fan of K.I.S.S.
Magnus Johansson
A: 

From your description it sounds a lot like your document is unformatted text only - so a stringbuilder would do fine.

If its a formatted document, I would be inclined to use the MS Word APIs or similar and just offload your document processing to them - will save you an awful lot of time as document parsing can often be a pain in the a** :-)

I wouldn't get too worried about the performance yet - it sounds a lot like you haven't implemented one yet, so you also don't know what performance characteristics the rest of your app has - it may be that you can't actually afford to hold multiple documents in memory at all when you actually get round to profiling it.

chrisb
A: 

Off the top of my head, I would have thought an indexed linked list would be fairly efficient for this sort of thing unless you have some very long lines.

The linked list would give you an efficient way to store the data and add or remove lines as the user edits. The indexing allows you to quickly jump to a particular point in your file. This sort of idea lends itself well to undo/redo type operations too as it should be reasonably easy to sort edits into small atomic operations.

I'd agree with crisb's point though, it's probably better to get something simple working first and then see if it really is slow..

Jon Cage
+3  A: 

I would suggest to break the file into blocks. All blocks have the same length when you load them, but the length of each block might change if the user edits this blocks. This avoids moving 100 megabyte of data if the user inserts one byte in the front.

To manage the blocks, just but them - together with the offset of each block - into a list. If the user modifies a blocks length you must only update the offsets of the blocks after this one. To find an offset, you can use binary search.

File size: 100 MiB
Block Size: 16 kiB
Blocks: 6400

Finding a offset using binary search (worst case): 13 steps
Modifying a block (worst case): copy 16384 byte data and update 6400 block offsets
Modifying a block (average case): copy 8192 byte data and update 3200 block offsets

16 kiB block size is just a random example - you can balance the costs of the operations by choosing the block size, maybe based on the file size and the probability of operations. Doing some simple math will yield the optimal block size.

Loading will be quite fast, because you load fixed sized blocks, and saving should perform well, too, because you will have to write a few thousand blocks and not millions of single lines. You can optimize loading by loading blocks only on demand and you can optimize saving by only saving all blocks that changed (content or offset).

Finally the implementation will not be to hard, too. You could just use the StringBuilder class to represent a block. But this solution will not work well for very long lines with lengths comparable to the block size or larger because you will have to load many blocks and display only a small parts with the rest being to the left or right of the window. I assume you will have to use a two dimensional partitioning model in this case.

Daniel Brückner
This makes a lot of sense, and covers pretty much all the requirements. Thanks
Colin
I've coded this up and it works pretty well.I used a linked list not a B-Tree as keeping track of the offsets of each block as the b-tree is re-balanced looked like it would take a while to code up. I also cache the original string until it is changed, on the first change it gets broken into blocks, as most of the time I never change the document.The performance with the linked list is fine, its doing more work working out the offsets of blocks (as it has to sum each block) but this is acceptable at the moment, and I can always move to a B-Tree later if needed, as its all encapsulated.
Colin
+1  A: 

I'd use a b-tree or skip list of lines, or larger blocks if you aren't going to edit much.

You don't have much extra cost determine line ends on load, since you have to visit each character on loading anyway.

You can move lines within a node without much effort.

The total length of the text in each node is stored in the node, and changes propagated up to parent nodes.

Each line is represented by a data array, and start index, length and capacity. Line break/carriage returns aren't put in the data array. Common operations such as breaking lines only requires changes to the references into the array; editing lines requires a copy if capacity is exceeded. A similar structure might be used per line temporarily when editing that line, so you don't perform a copy on each key-press.

Pete Kirkham
My first cut of this held data in a b-tree as you described, but the implementation became a bit complex. re-balancing b-trees is pretty tricky, and overkill for what I was after, if I'd had a good base implementation to derive from then this would have been the way to go.
Colin
+4  A: 

Good Math, Bad Math wrote an excellent article about ropes and gap buffers a while ago that details the standard methods for representing text files in a text editor, and even compares them for simplicity of implementation and performance. In a nutshell: a gap buffer - a large character array with an empty section immediately after the current position of the cursor - is your simplest and best bet.

Nick Johnson
+1. Gap buffers are cool: insert/delete/move the cursor 1 character in O(1) time; move the cursor n characters in O(n) time. This is usually a win because when editing or paging through text, most operations are local.
j_random_hacker
I like gap buffers, too, but to use them you need one huge contiguous block of address space large enough for the entire file plus all possible insertions. Depending on the system, allocation 100+ MB of contiguous address space could be challenging. I think breaking the file into blocks is going to be essential. A piece table might then fit the bill for the edits. I don't know anything about C# strings or string builders--are they always contiguous?
Adrian McCarthy
100MB of address space even in a 32 bit architecture is unlikely to be an issue. For files on the order of a gig, this could be an issue - but that'll become less of an issue now that people are switching to 64 bit architectures.
Nick Johnson
A: 

You might find this paper useful --- Data Structures for Text Sequences which describes and experimentally analyses a few standard algorithms, and compares [among other things] gap buffers and piece tables.

FWIW, it concludes piece tables are slightly better overall; though net.wisdom seems to prefer gap buffers.

kbs