ansaurus

Question

What does .NET's String.Normalize do?

Answer 1

+2 A:

This link has a good explanation:

http://unicode.org/reports/tr15/#Norm_Forms

From what I can surmise, its so you can compare two unicode strings for equality.

Adam 2010-07-20 08:22:18

Answer 2

+7 A:

It makes sure that unicode strings can be compared for equality (even if they are using different unicode encodings).

From Unicode Standard Annex #15:

Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform each string into one of the Unicode Normalization Forms. A binary comparison of the transformed strings will then determine equivalence.

Oded 2010-07-20 08:22:35

Excellent answer. Provided link is great!

GeReV 2010-07-20 08:54:39

Answer 3

+3 A:

One difference between form C and form D is how letters with accents are represented: form C uses a single letter-with-accent codepoint, while form D separates that into a letter and an accent.

A side-effect is that this makes it possible to easily create a "remove accents" method.

    public static string RemoveAccents(string input)
    {
        return new string(
            input
            .Normalize(System.Text.NormalizationForm.FormD)
            .ToCharArray()
            .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
            .ToArray());
        // the normalization to FormD splits accented letters in accents+letters
        // the rest removes those accents (and other non-spacing characters)
    }

Hans Kesting 2010-07-20 08:25:31

+1 for the interesting example.

GeReV 2010-07-20 08:55:01

Answer 4

+3 A:

In Unicode, a (composed) character can either have a unique code point, or a sequence of code points consisting of the base character and its accents.

Wikipedia lists as example Vietnamese ế (U+1EBF) and its decomposed sequence U+0065 (e) U+0302 (circumflex accent) U+0301 (acute accent).

string.Normalize() converts between the 4 normal forms a string can be coded in Unicode.

devio 2010-07-20 08:33:14

ansaurus

tags:

views:

answers:

What does .NET's String.Normalize do?

related questions