When writing a custom string class that stores UTF-8 internally (to save memory) rather than UTF-16 from scratch is it feasible to some extent cache the relationship between byte offset and character offset to increase performance when applications use the class with random access?
Does Perl do this kind of caching of character offset to byte offset relationship? How do Python strings work internally?
What about Objective-C and Java? Do they use UTF-8 internally?
EDIT
Found this reference to Perl 5 using UTF-8 internally:
"$flag = utf8::is_utf8(STRING)
(Since Perl 5.8.1) Test whether STRING is in UTF-8 internally. Functionally the same as Encode::is_utf8()."
On page
http://perldoc.perl.org/utf8.html
EDIT
In the applications I have in mind, the strings have 1-2K XML stanzas in an XMPP stream. About 1% of the messages are going to have I expect up to 50% (by character count) of Unicode values > 127 (this is XML). In servers, the messages are rule-checked and routed conditionally on a small (character volume wise) subset of fields. The servers are Wintel boxes operating in a farm. In clients, the data comes from and is fed into UI toolkits.
EDIT
But the app wil inevitably evolve and want to do some random access too. Can the performance hit when this happens be minimised: I was also interested if a more general class design exists that eg manages b-trees of character offset <-> byte offset relationships for big UTF8 strings (or some other algorithm found to be efficient in the general case.)