+7  Q: 

Text editor theory

As I'm always dissatisfied with existing editors, a project I always wanted to start is my own text editor. However doing text editing is serious business.

Besides analyzing the source code of existing text editors, is there any book or other resource (like academic work) about this topic? I'm interested especially in something that teaches how to handle memory and how to manage text insertion (if you have a 100 MB file and want to add a char at x position, you can't just memove the huge text block...).


how to manage text insertion (if you have a 100 MB file and want to add a char at x position, you can't just memove the huge text block...).

Make the text file a linked list, where every line is an entry.

Bart van Heukelom
That's good until you have really long lines :P
@Earlz: Then split those lines and make each its own linked list, tree or other appropriate collection.
Bart van Heukelom
No, that's not good if you have text lines of 2 MBs... or even a single huge text of line without end lines...
@Lorenzo See my previous comment
Bart van Heukelom
Actually, *Software Tools* (or maybe the *In Pascal* version) talks a bit about this idea, noting that they tried but found it unworkable.
Jerry Coffin
+3  A: 

Over the years I've written quite a number of different text editors. Certainly the simplest way is to manage a long sequence of characters, where you copy everything around to insert any character. Other techniques that I've used include:

  • Represent the text file as a doubly linked list of text lines.
  • Construct a tree-like data structure (sometimes called a "rope") which starts off as a solid string of characters, but has the ability to split, insert, and delete blocks of text without having to move all the rest of the text around.

Many of the old Borland example books used a text editor as a tutorial example. You can occasionally still find copies of these at used bookstores for nearly free.

Greg Hewgill
+1  A: 

Well if you know that in general people will have relatively few insert points, you could hold an array of pointers into your original text buffer and when the user tries to insert inside it, you "split" the buffer by spawning another pointer to the rest of the buffer, making the first pointer's length so it stops at the insertion point and adding a 3rd pointer for the text to be inserted inbetween, kinda like:

long original text la la la
^                *^
|                 2nd part
1st part

And the star points into a new text buffer where you start adding the text to be inserted.

When you render (or in general parse) your text file, you loop over the array of pointers and then do your work on each buffer. Of course if the buffer is small enough, you skip the adding a new buffer part, but that's just heuristics, try each and get a feel for which works best.

You could also consider splitting the text file on load into multiple buffers say every 1MB or so, because if you load the file in a single buffer you WILL need to make a new buffer for inserted text due to the size. Again, this is a heuristic optimization.

Dividing text in chunks was actually my first idea. I thought about using memory page sized chunks to avoid memory fragmentation. However this seems too trivial. Aren't available more powerful approaches, to handle situations like huge text moves?
If you do it like this you don't need to MOVE anything. That's the thing you're trying to avoid, moving stuff in memory (unless of course the user dragged text in another position, but that's rare). For normal insertion, you only add text to the end of a buffer.
+3  A: 

Take a look at Rob Pike's description of his Sam text editor. Be sure to browse past the high-level overview and command language. He describes parsing, memory management, and data structures later in the paper.

In addition, take a look at Russ Cox's simple regular expression implementation. It's easy to follow and may open some doors outside existing regular expression libraries.

Corbin March
That seems nice! Thanks
+2  A: 

Promoted to answer by request:

The antique "Software Tools in Pascal" by Kernighan and Plaugher implements the ed editor in a language with neither real strings nor pointers. It contains a great overview of the design considerations that underlie any text editor.

Maybe there have been some improvements in the theory in the meantime, but it is worth reading for sure!
+2  A: 

One old method that still works is called a split buffer. The basic idea is that you put the text into a buffer, but instead of putting it all in one block, you "split" it at the cursor, putting all the text before the cursor at the beginning of the buffer, and all the text after the cursor at the end of the buffer. Most insertions take place at the cursor, which you can do without moving anything (until or unless you overflow the buffer). When the user moves the cursor, you move the appropriate text from one side of the split to the other.

Given typical controls (cursor left, right, up, down, page up, page down), the largest move you typically deal with is a page at a time, which is typically easy to handle quite a bit faster than a keyboard repeats. It can, of course, slow down a bit if you have a truly huge file and a "goto line" command, or something on that order. If you're going to do a lot of that, there are undoubtedly better structures to use...

Jerry Coffin
+1  A: 

The Scintilla component uses a split buffer, along the theory explained in a text linked in their Scintilla and SciTE Related Sites page.
The linked page is Data Structures in a Bit-Mapped Text Editor.
The split buffer proved it works well even with megabyte files. Using secondary structures (eg. a list of line starts) can help too.