ansaurus

Question

How do I remove diacritics (accents) from a string in .NET?

Answer 1

+6 A:

I don't really know what your situation is, but I would strongly encourage you not to do this. A good reference is Joel's article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Greg Hewgill 2008-10-30 02:16:27

That's a great article! But I think James has a unique situation that warrants this with respect to urls in the address bar...no?

Codewerks 2008-10-31 15:32:13

URLs are a good excuse, I've been thinking that it could even be a good idea to romanize Arabic for the same purpose, by replacing every Arabic letter by an ASCII letter that sounds similar.

Osama ALASSIRY 2008-12-30 12:03:53

One general case when this is valid is when searching and matching data across different languages and sources. Then this makes all sources consistent and suitable for matching (but not for presenting).

grigory 2010-08-17 21:02:08

Answer 2

+1 A:

This question seems similar to how-to-remove-accents-and-tilde-in-a-c-stdstring

David Dibben 2008-10-30 02:20:50

Answer 3

+44 A:

I've not used this method, but Michael Kaplan describes a method for doing so in his blog post (with a confusing title) that talks about stripping diacritics:

Stripping is an interesting job (aka On the meaning of meaningless, aka All Mn characters are non-spacing, but some are more non-spacing than others)

static string RemoveDiacritics(string stIn) {
  string stFormD = stIn.Normalize(NormalizationForm.FormD);
  StringBuilder sb = new StringBuilder();

  for(int ich = 0; ich < stFormD.Length; ich++) {
    UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
    if(uc != UnicodeCategory.NonSpacingMark) {
      sb.Append(stFormD[ich]);
    }
  }

  return(sb.ToString().Normalize(NormalizationForm.FormC));
}

Note that this is a followup to his earlier post

Stripping diacritics....

The approach uses String.Normalize to split the input string into constituent glyphs (basically separating the "base" characters from the diacritics) and then scans the result and retains only the base characters. It's just a little complicated, but really you're looking at a complicated problem.

Of course, if you're limiting yourself to French, you could probably get away with the simple table-based approach in How to remove accents and tilde in a C++ std::string, as recommended by @David Dibben.

Blair Conrad 2008-10-30 02:29:01

Thanks for this, marked as the answer. Basically my application needed to take a title of a section for a website, and change that to a "viewname" in which our flash application could relate it to our navigation bar. this does precisely what i need.

James Hall 2008-10-30 13:52:39

Perfect solution ... far better than Regular expression based clean up like this http://www.myownpercept.com/2009/06/replacing-special-chars-string-csharp/

jalchr 2010-07-07 22:37:43

Answer 4

+4 A:

A warning: This approach might work in some specific cases, but in general you cannot just remove diacritical marks. In some cases and some languages this might change the meaning of the text.

You don't say why you want to do this; if it is for the sake of comparing strings or searching you are most probably better off by using a unicode-aware library for this.

JacquesB 2008-10-30 10:48:35

Answer 5

A:

Thanks to all, this is not going to be used for trying to display to the user for reading purposes (meaning that the diacritics don't really matter).

James Hall 2008-10-30 13:41:54

Answer 6

+1 A:

In case anyone's interested, here is the java equivalent:

import java.text.Normalizer;

public class MyClass
{
    public static String removeDiacritics(String input)
    {
        String nrml = Normalizer.normalize(input, Normalizer.Form.NFD);
        StringBuilder stripped = new StringBuilder();
        for (int i=0;i<nrml.length();++i)
        {
            if (Character.getType(nrml.charAt(i)) != Character.NON_SPACING_MARK)
            {
                stripped.append(nrml.charAt(i));
            }
        }
        return stripped.toString();
    }
}

KenE 2009-02-13 16:54:53

instead of stripped += nrml.charAt(i) use a StringBuilder. you have O(n²) runtime hidden here.

Andreas Petersson 2009-09-09 08:50:34

updated per above - nice catch!

KenE 2009-09-09 18:17:53

Answer 7

+3 A:

In case someone is interested, I was looking for something similar and ended writing the following:

    public static string NormalizeStringForUrl(string name)
    {
        String normalizedString = name.Normalize(NormalizationForm.FormD);
        StringBuilder stringBuilder = new StringBuilder();

        foreach (char c in normalizedString)
        {
            switch (CharUnicodeInfo.GetUnicodeCategory(c))
            {
                case UnicodeCategory.LowercaseLetter:
                case UnicodeCategory.UppercaseLetter:
                case UnicodeCategory.DecimalDigitNumber:
                    stringBuilder.Append(c);
                    break;
                case UnicodeCategory.SpaceSeparator:
                case UnicodeCategory.ConnectorPunctuation:
                case UnicodeCategory.DashPunctuation:
                    stringBuilder.Append('_');
                    break;
            }
        }
        string result = stringBuilder.ToString();
        return String.Join("_", result.Split(new char[] { '_' }
            , StringSplitOptions.RemoveEmptyEntries)); // remove duplicate underscores
    }

Luk 2009-04-23 08:32:00

You should preallocate the StringBuilder buffer to the name.Length to minimize memory allocation overhead.That last Split/Join call to remove sequential duplicate _ is interesting. Perhaps we should just avoid adding them in the loop. Set a flag for the previous character being an _ and not emit one if true.

IDisposable 2009-09-01 21:18:15

2 really good points, I'll rewrite it if I ever get the time to go back to this portion of code :)

Luk 2009-09-08 09:14:07

Answer 8

+1 A:

This works fine in java.

It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics. Now you can use a regex to strip off the diacritics.

import java.text.Normalizer;
import java.util.regex.Pattern;

public String deAccent(String str) {
    String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD); 
    Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
    return pattern.matcher(nfdNormalizedString).replaceAll("");
}

hashable 2009-07-31 21:16:18

Answer 9

+1 A:

this did the trick for me...

string accentedStr;
byte[] tempBytes;
tempBytes = System.Text.Encoding.GetEncoding("ISO-8859-8").GetBytes(accentedStr);
string asciiStr = System.Text.Encoding.UTF8.GetString(tempBytes);

quick&short!

azrafe7 2010-01-18 14:16:44

Answer 10

A:

THIS IS THE VB VERSION (Works with GREEK) :

Imports System.Text

Imports System.Globalization

Public Function RemoveDiacritics(ByVal s As String)
    Dim normalizedString As String
    Dim stringBuilder As New StringBuilder
    normalizedString = s.Normalize(NormalizationForm.FormD)
    Dim i As Integer
    Dim c As Char
    For i = 0 To normalizedString.Length - 1
        c = normalizedString(i)
        If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
            stringBuilder.Append(c)
        End If
    Next
    Return stringBuilder.ToString()
End Function

Stefanos Michanetzis 2010-07-28 13:21:20

ansaurus

tags:

views:

answers:

How do I remove diacritics (accents) from a string in .NET?

related questions