A good text editor should be useful for all kinds of work a programmer might do, and that includes opening files that may sometimes be several gigabytes in size. Therefore I would not recommend a mind set where everything is to be buffered in RAM.
I would recommend setting up a search tree of slices representing the file, where a single slice may be:
- A reference to a range of bytes in the actual file on disk, or
- A reference to an edited "page".
When you open a file you start by inserting a single item into the tree, which is simply a range representing the whole file, e.g. for a 10-MiB file:
std::map<size_t, slice_info> slices;
slices[0].size = 10*1024*1024;
When the user edits the file, create a "page" which is some reasonable size, say 4 KiB, around the edit point. The tree is spliced at that point. In the example, the edit point is at 5 MiB:
size_t const PAGE_SIZE = 4*1024;
slices[0].size = 5*1024*1024;
slices[5*1024*1024].size = PAGE_SIZE;
slices[5*1024*1024].buffer = create_buffer(file, 5*1024*1024, PAGE_SIZE);
slices[5*1024*1024 + PAGE_SIZE].size = 5*1024*1024 - PAGE_SIZE
You can use memory-mapped files both for the read-only buffer (the source file) and for the copied editable buffers (the latter would be placed in a temp directory). This also allows recovery should the editor crash.
Using fixed-size pages will reduce fragmentation of the memory heap a lot since all blocks have the same size, and inserting text will never require moving more than 4 KiB of data ahead of you.
This is a simplified description to give the general idea without getting into too many gritty details. A real implementation would most likely need to be more sophisticated, e.g. allow for a variable amount of data in a page to cope with pages that overflow, and merge together many small slices so that running a regex substitution across a large file does not create too many small buffers. There probably needs to be a limit for the number of slices you should have in the tree simultaneously, but a key point is that when you start inserting somewhere you should make sure that you are working with a slice that isn't too big.
For regex, I don't think the performance is much of a problem as long as the whole editor doesn't hang while running it. Try Boost.Regex, it will most likely be fast enough for your needs, and it is also generic enough to plug in any buffering strategy you need.
The same applies to syntax highlighting, if you run it in the background it won't disturb the user so much while he is typing. You can use the slice approach to your benefit here:
- Each slice can have a mutex that can be locked during an editing operation, allowing syntax highlighting or "intellisense" type analysis to run in a background thread.
- You can store the state of the syntax highlighting engine so that whenever you make edits in a slice you can restart the syntax highlighting from the beginning of that slice, rather than from the beginning of the file.
I am not aware of any freestanding syntax highlighting engines, but they are usually based on regex substitution (see e.g. the syntax highlighting files in vim).