Which data structure/s is used in implementation of editors like notepad. This data structure should be extensible, and should support various features like edition, deletion, scrolling, selection of range of text etc?
Check out the implementation of Notepad++, you can view the source on SourceForge
The usual thing is to have something like a list or array of arrays of characters. There has been a lot of stuff done on this over the years: you might have a look at this google search.
We wrote an editor for an old machine (keep in mind that this was a while ago, about 1986, so this is from memory, and the state of the art may have advanced somewhat since then) which we managed to get to scream along, performance wise, by using fixed memory blocks from self-managed pools.
It had two pools, each containing a fixed number of specific-sized blocks (one pool was for line structures, the other for line-segment structures). It was basically a linked list of linked lists.
Memory was pre-allocated from a 'malloc()
'-like call, and we used 65,534 (block number 65,535 was the NULL-block) blocks each for lines (384K or 512K for the padded version) and line segments (2G) to handle files up to 65,534 lines and ~1.6G of file, which was pretty big back then). That was the theoretical file size limit - I don't think we ever approached that in reality since we never allocated the full set of line segment structures.
Not having to call malloc()
for every little block of memory gave us a huge speed increase, especially as we could optimize our own memory allocation routines for fixed size blocks (including inlining the calls in the final optimized version).
The structures in the two pools were as follows:
Line structure Line-segment structure
+--------+ +--------+
|NNNNNNNN| |nnnnnnnn|
|NNNNNNNN| |nnnnnnnn|
|PPPPPPPP| |pppppppp|
|PPPPPPPP| |pppppppp|
|bbbbbbbb| |LLLLLLLL|
|bbbbbbbb| |LLLLLLLL|
|........| |xxxxxxxx|
|........| : :
+--------+ |xxxxxxxx|
+--------+
where:
- Lower-case letters point to the line segment pool (except
x
). - Upper-case letters to the line pool.
N
was a block number for the next line (0xffff was NULL for all these).P
the the block number for the previous lineb
was the block number for the first line segment in that line..
was reserved padding (to bump the structure out to 8 bytes).n
was the block number for the next line segment.p
was the block number for the previous line segment.L
was the block number for the segment's line block.x
was the 26 characters in that line segment.
The reason the line structure was padded was to speed up the conversion of block numbers into actual memory locations (shifting left by 3 bits was much faster than multiplying by 6 in that particular architecture and extra memory used was only 128K, minimal compared to the total storage used) although we did provide the slower version for those who cared more about memory.
We also had an array of 100 16-bit values which contained the line segment (and line number so we could quickly go to specific lines) at roughly that percentage (so that array[7] was the line that was roughly 7% into the file) and two free pointers to maintain the free list in each pool (this was a very simple one way list where N
or n
in the structure indicated the next free block and free blocks were allocated from, and put back to, the front of these lists).
There was no need to keep a count of the characters in each line segment since 0-bytes were not valid in files. Each line segment was allowed to have 0-bytes at the end that were totally ignored. Lines were compressed (i.e., line segments were combined) whenever they were modified. This kept block usage low (without infrequent and lengthy garbage collection) and also greatly sped up search-and-replace operations.
The use of these structures allowed very fast editing, insertion, deletion, searching and navigation around the text, which is where you're likely to get most of your performance problems in a simple text editor.
The use of selections (we didn't implement this as it was a text mode editor that used vi-like commands [e.g., 3d to delete 3 lines, 6x to delete 6 characters]) could be implemented by having a (line#,char-pos) tuple to mark positions in the text, and use two of those tuples for a selection range.
Wikipedia says many editors use a Gap Buffer. It is basically an array with an unused space in the middle. The cursor sits just before the gap, so deletion and insertion at the cursor is O(1). It should be pretty easy to implement.
Looking at the source code of Notepad++ (as Chris Ballance suggested in this thread here) shows that they also use a gap buffer. You could get some implementation ideas from that.
There is an excellent article about Piece Chains by James Brown, author of HexEdit.
In a nutshell: Piece chains allow you to record the changes made to the text. After loading, you have a piece chain that spans the entire text. Now you insert somewhere in the middle.
Instead of allocating a new buffer, copying the text around, etc., you create two new pieces and modify the existing one: The existing one now contains the text up to the insertion point (i.e. you just change the length of the piece), then you have a piece with the new text and after that a new piece with all the text after the insertion. The original text is left unchanged.
For undo/redo, you simple remember which pieces you added/removed/changed.
The most complex area when using piece chains is that there is no longer a 1:1 mapping between an offset in the visible text and the memory structure. You either have to search the chain or you must maintain a binary tree structure of some kind.