String class internals - caching character offset to byte relationship if using UTF-8

views:

250

answers:

+1 Q:

String class internals - caching character offset to byte relationship if using UTF-8

When writing a custom string class that stores UTF-8 internally (to save memory) rather than UTF-16 from scratch is it feasible to some extent cache the relationship between byte offset and character offset to increase performance when applications use the class with random access?

Does Perl do this kind of caching of character offset to byte offset relationship? How do Python strings work internally?

What about Objective-C and Java? Do they use UTF-8 internally?

EDIT

Found this reference to Perl 5 using UTF-8 internally:

"$flag = utf8::is_utf8(STRING)

(Since Perl 5.8.1) Test whether STRING is in UTF-8 internally. Functionally the same as Encode::is_utf8()."

On page

http://perldoc.perl.org/utf8.html

EDIT

In the applications I have in mind, the strings have 1-2K XML stanzas in an XMPP stream. About 1% of the messages are going to have I expect up to 50% (by character count) of Unicode values > 127 (this is XML). In servers, the messages are rule-checked and routed conditionally on a small (character volume wise) subset of fields. The servers are Wintel boxes operating in a farm. In clients, the data comes from and is fed into UI toolkits.

EDIT

But the app wil inevitably evolve and want to do some random access too. Can the performance hit when this happens be minimised: I was also interested if a more general class design exists that eg manages b-trees of character offset <-> byte offset relationships for big UTF8 strings (or some other algorithm found to be efficient in the general case.)

+1 A:

I think the answer is: in general, it's not really worth trying to do this. In your specific case, maybe.

If most of your characters are plain ASCII, and you rarely have UTF sequences, then it might be worth building some kind of sparse data structure with the offsets.

In the general case, every single character might be non-ASCII and you might have many many offsets to store. Really, the most general case would be to make a string of bytes that is exactly as long as your string of Unicode characters, and have each byte value be the offset of the next character. But this means one whole byte per character, and thus a net savings of only one byte per Unicode character; probably not worth the effort. And that implies that indexing into your string is now an O(n) operation, as you run through these offsets and sum them to find the actual index.

If you do want to try the sparse data structure, I suggest an array of pairs of values, the first value being the index within the Unicode string of a character, and the second one being the index within the byte sequence where this character actually appears. Then after each UTF8 escape sequence, you would add the two values to find the next character in the string. Finally, when given an index to a Unicode character, your code could do a binary search of this array, to find the highest index within the sparse array that is lower than the requested index, and then use that to find the actual byte that represents the start of the desired character.

If you need to save memory, you might want to consider using a data compression library. Slurp in the Unicode strings as full Unicode, then compress them; then to index into a string, first you uncompress that string. This will really save memory, and it will be easy and fast to get the code correct to make it work; but it may add too much CPU overhead to be reasonable.

steveha 2009-12-21 20:10:04

OTOH it might be worthwhile for a string with a lot of English ASCII meta data like XML often has for example.

martinr 2009-12-21 21:56:49

+1 A:

Java's strings are UTF-16 internally:

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

java.lang.String

Daniel Yankowsky 2009-12-21 20:22:40

this is a perl question, not java one

J-16 SDiZ 2009-12-22 04:33:59

The original poster asked: "What about Objective-C and Java? Do they use UTF-8 internally?" This question is tagged with 4 different languages. This is not just a Perl question.

Daniel Yankowsky 2009-12-22 05:56:18

+2 A:

Perl distinguishes between Unicode and non-Unicode strings. Unicode strings are implemented using UTF-8 internally. Non-Unicode does not necessarily mean 7-bit ASCII, though, it could be any character that can be represented in the current locale as a single byte.

Dan 2009-12-21 21:26:31

Hmm... Anyone care to elaborate...? What optimisation does Perl have, if any?

martinr 2009-12-21 21:46:40

OK, having looked at the Perl code C internals, it appears that it does have the concept of cursors for accessing UTF8 strings, and measuring numbers of characters between two cursors by inspection. Though these cursors are of course C pointers in the Perl C code. There seems to be no mention of other types of access into the UTF8 than pointers. So in conclusion, I see no reason to think from the Perl code that a Perl-esque UTF8 string wrapper class in C# or Java could offer higher performance than a byte array and byte index. If no one else has anything to add, I'm done here I think.

martinr 2009-12-21 23:40:09

The bit of Perl internals I looked at was just utf8.c actually. That's good enough for me, others may want to hunt further. :)

martinr 2009-12-22 00:13:55

ansaurus

tags:

views:

answers:

String class internals - caching character offset to byte relationship if using UTF-8

related questions