views:

157

answers:

5

This question is probably going to make me sound pretty clueless. That's because I am.

I'm just thinking, if I were hypothetically interested in designing my own text editor GUI control, widget, or whatever you want to call it (which I'm not), how would I even do it?

The temptation to a novice such as myself would be to store the content of the text editor in the form of a string, which seems quite costly (not that I'm too familiar with how string implementations differ between one language/platform and the next; but I know that in .NET, for example, they're immutable, so frequent manipulation such as what you'd need to support in a text editor would be magnificently wasteful, constructing one string instance after another in very rapid succession).

Presumably some mutable data structure containing text is used instead; but figuring out what this structure might look like strikes me as a bit of a challenge. Random access would be good (I would think, anyway—after all, don't you want the user to be able to jump around to anywhere in the text?), but then I wonder about the cost of, say, navigating to somewhere in the middle of a huge document and starting to type immediately. Again, the novice approach (say you store the text as a resizeable array of characters) would lead to very poor performance, I'm thinking, as with every character typed by the user there would be a huge amount of data to "shift" over.

So if I had to make a guess, I'd suppose that text editors employ some sort of structure that breaks the text down into smaller pieces (lines, maybe?), which individually comprise character arrays with random access, and which are themselves randomly accessible as discrete chunks. Even that seems like it must be a rather monstrous oversimplification, though, if it is even remotely close to begin with.

Of course I also realize that there may not be a "standard" way that text editors are implemented; maybe it varies dramatically from one editor to another. But I figured, since it's clearly a problem that's been tackled many, many times, perhaps a relatively common approach has surfaced over the years.

Anyway, I'm just interested to know if anyone out there has some knowledge on this topic. Like I said, I'm definitely not looking to write my own text editor; I'm just curious.

+13  A: 

One technique that's common (especially in older editors) is called a split buffer. Basically, you "break" the text into everything before the cursor and everything after the cursor. Everything before goes at the beginning of the buffer. Everything after goes at the end of the buffer.

When the user types in text, it goes into the empty space in between without moving any data. When the user moves the cursor, you move the appropriate amount of text from one side of the "break" to the other. Typically there's a lot of moving around a single area, so you're usually only moving small amounts of text at a time. The biggest exception is if you have a "go to line xxx" kind of capability.

Charles Crowley has written a much more complete discussion of the topic. You might also want to look at The Craft of Text Editing, which covers split buffers (and other possibilities) in much greater depth.

Jerry Coffin
a.k.a. "gap buffer"
Boojum
No question on this topic should be without a link to Crowley's book.
dmckee
Took me a second to get what you're saying (so there's basically a sizable gap between the ends of a big block, and when you insert text it goes in the middle, to be pushed to one side or the other depending on where you click next). That's pretty clever! It does seem like a pretty significant limitation that this design relies on you not going back and forth between faraway points in the text (and inserting), though, doesn't it? Then again, I suppose if it covers the use case that's considerably more common, that's definitely a strength.
Dan Tao
+3  A: 

For some resources, you could read how do text editors work?

jarmod
+1  A: 

Depending on the amount of text that needs to be in the editor at one time, a one string for the entire buffer approach would probably be fine. I think Notepad does this-- ever notice how much slower it gets to insert text in a large file?

Having one string per line in a hash table seems like a good compromise. It would make navigation to a particular line and delete/paste efficient without much complexity.

If you want to implement an undo function, you'll want a representation that allows you to go back to previous versions without storing 30 copies of the entire file for 30 changes, although again that would probably be fine if the file was sufficiently small.

Chris Smith
I doubt a string per line approach is feasible. That is more a rendering problem than a data modelling problem. It would very cumbersome if you want more advanced features like document width formatting, font size etc.
willcodejavaforfood
Yeah, if there were a lot of layout and formatting requirements you would want a more sophisticated layout element than lines. It sounded like a basic plain text editor though.
Chris Smith
@willcodejavaforfood: My own text editor uses one string per line. And by line I mean newlines ("\n" or "\r" or "\r\n") not physical lines on the screen. This takes advantage that almost all known human readable text have frequent newlines. This way, I only need to deal with small strings within a large array (or in my case, a list).
slebetman
+1  A: 

The simplest way would be to use some kind of string buffer class provided by the language. Even a simple array of char objects would do at a pinch.

Appending, replacing and seeking text are then relatively quick. Other operations are potentially more time-consuming, of course, with insertion of a sequence of characters at the start of the buffer being one of the more expensive actions.

However, this may be perfectly acceptable performance-wise for a simple use case.

If the cost of insertions and deletions is particularly significant, I'd be tempted to optimise by creating a buffer wrapper class that internally maintained a list of buffer objects. Any action (except simple replacement) that didn't occur at the tail of an existing buffer would result in the buffer concerned being split at the relevant point, so the buffer could be modified at its tail. However, the outer wrapper would maintain the same interface as a simple buffer, so that I didn't have to rewrite e.g. my search action.

Of course, this simple approach would quickly end up with an extremely fragmented buffer, and I'd consider having some kind of rule to coalesce the buffers when appropriate, or to defer splitting a buffer in the case of e.g. a single character insertion. Maybe the rule would be that I'd only ever have at most 2 internal buffers, and I'd coalesce them before creating a new one – or when something asked me for a view of the whole buffer at once. Not sure.

Point is, I'd start simple but access the mutable buffer through a carefully chosen interface, and play with the internal implementation if profiling showed me I needed to.

However, I definitely wouldn't start with immutable String objects!

Bill Michell
Yeah, I agree that a simple approach like this (well, what you describe in your first few paragraphs) would probably be acceptable for small files, with the advantage of being extremely easy to implement. It almost seems to me that if you built a text editor like this, it would be your duty to make it refuse to open files above a certain size.
Dan Tao
+1  A: 

A while back, I wrote my own text editor in Tcl (actually, I stole the code from somewhere and extended it beyond recognition, ah the wonders of open source).

As you mentioned, doing string operations on very, very large strings can be expensive. So the editor splits the text into smaller strings at each newline ("\n" or "\r" or "\r\n"). So all I'm left with is editing small strings at line level and doing list operations when moving between lines.

The other advantage of this is that it is a simple and natural concept to work with. My mind already considers each line of text to be separate reinforced by years of programming where newlines are stylistically or syntactically significant.

It also helps that the use case for my text editor is as a programmers editor. For example, I implemented syntax hilighting but not word/line wrap. So in my case there is a 1:1 map between newlines in text and the lines drawn on screen.

In case you want to have a look, here's the source code for my editor: http://wiki.tcl.tk/16056

It's not a toy BTW. I use it daily as my standard console text editor unless the file is too large to fit in RAM (Seriously, what text file is? Even novels, which are typically 4 to 5 MB, fit in RAM. I've only seen log files grow to hundreds of MB).

slebetman