String class based on graphemes? | ansaurus

tags:

views:

30

answers:

1

+1 Q:

String class based on graphemes?

I'm wondering why we don't have some string classes that represent a string of Unicode grapheme clusters instead of code points or characters. It seems to me that in most applications it would be easier for programmers to access components of a grapheme when necessary than to have to organize them from code points, which appears necessary even if only to avoid casually breaking a string in "mid-grapheme" (at least in theory). Internally a string class might use a variable length encoding such as UTF-8, UTF-16, or in this context even UTF-32 is variable length; or implement subclasses for all of them (and optionally configure the choice at run-time so that different languages could use their optimal encodings). But if programmers could "see" grapheme units when inspecting a string, wouldn't string handling code in general be closer to achieving correctness, and without much extra complexity?

References:
Characters and Combining Marks
Unicode implementer's guide part 4: grapheme breaking
UnicodeString Class Reference
Enumerating a string by grapheme instead of character
Strings and character encoding in C++

A:

I don't think so, because grapheme breaks are not the only measure of correctness. And, there are different user perceived characters depending on the language/script being used. If you are concerned about normalization mode you will also want to see Normalizer::concatenate. So I would recommend just working in code units most of the time and calculating breaks when need be.

Steven R. Loomis 2010-10-25 17:28:18

related questions

Does anyone have a good Proper Case algorithm

Converting bool to text in C++

Does PHP have an equivalent to this type of Python string substitution?

PHP ToString() equivalent

What's the difference between a string constant and a string literal?

What would be the fastest way to remove Newlines from a String in C#?

Why is String.Format static?

PowerShell - How do I pass string parameters correctly?

What's the best string concatenation method using C#?

Java: Best way of converting List<Integer> to List<String>

C# String output: format or concat?

Parse usable Street Address, City, State, Zip from a string

C# Save Dialogs

How do I Convert a string to an enum in C#?

PowerShell - how do I do a string replacement in a function?

Case insensitive string comparison in C++

Why doesn't Ruby have a real StringBuffer or StringIO?

Checking for string contents? string Length Vs Empty String

Remove Quotes and Commas from a String in MySQL

Test serialization encoding

In C# what is the difference between String and string

String.indexOf function in C

What is the best way to parse strings in Java

Format string to title case

Generate list of all possible permutations of a string