views:

5357

answers:

10

I'm trying to convert some strings that are in French Canadian and basically, I'd like to be able to take out the French accent marks in the letters while keeping the letter. (E.g. convert é to e.)

What is the best method for achieving this?

+6  A: 

I don't really know what your situation is, but I would strongly encourage you not to do this. A good reference is Joel's article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Greg Hewgill
That's a great article! But I think James has a unique situation that warrants this with respect to urls in the address bar...no?
Codewerks
URLs are a good excuse, I've been thinking that it could even be a good idea to romanize Arabic for the same purpose, by replacing every Arabic letter by an ASCII letter that sounds similar.
Osama ALASSIRY
One general case when this is valid is when searching and matching data across different languages and sources. Then this makes all sources consistent and suitable for matching (but not for presenting).
grigory
+1  A: 

This question seems similar to how-to-remove-accents-and-tilde-in-a-c-stdstring

David Dibben
+44  A: 

I've not used this method, but Michael Kaplan describes a method for doing so in his blog post (with a confusing title) that talks about stripping diacritics:

Stripping is an interesting job (aka On the meaning of meaningless, aka All Mn characters are non-spacing, but some are more non-spacing than others)

static string RemoveDiacritics(string stIn) {
  string stFormD = stIn.Normalize(NormalizationForm.FormD);
  StringBuilder sb = new StringBuilder();

  for(int ich = 0; ich < stFormD.Length; ich++) {
    UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
    if(uc != UnicodeCategory.NonSpacingMark) {
      sb.Append(stFormD[ich]);
    }
  }

  return(sb.ToString().Normalize(NormalizationForm.FormC));
}

Note that this is a followup to his earlier post

Stripping diacritics....

The approach uses String.Normalize to split the input string into constituent glyphs (basically separating the "base" characters from the diacritics) and then scans the result and retains only the base characters. It's just a little complicated, but really you're looking at a complicated problem.

Of course, if you're limiting yourself to French, you could probably get away with the simple table-based approach in How to remove accents and tilde in a C++ std::string, as recommended by @David Dibben.

Blair Conrad
Thanks for this, marked as the answer. Basically my application needed to take a title of a section for a website, and change that to a "viewname" in which our flash application could relate it to our navigation bar. this does precisely what i need.
James Hall
Perfect solution ... far better than Regular expression based clean up like this http://www.myownpercept.com/2009/06/replacing-special-chars-string-csharp/
jalchr
+4  A: 

A warning: This approach might work in some specific cases, but in general you cannot just remove diacritical marks. In some cases and some languages this might change the meaning of the text.

You don't say why you want to do this; if it is for the sake of comparing strings or searching you are most probably better off by using a unicode-aware library for this.

JacquesB
A: 

Thanks to all, this is not going to be used for trying to display to the user for reading purposes (meaning that the diacritics don't really matter).

James Hall
+1  A: 

In case anyone's interested, here is the java equivalent:

import java.text.Normalizer;

public class MyClass
{
    public static String removeDiacritics(String input)
    {
        String nrml = Normalizer.normalize(input, Normalizer.Form.NFD);
        StringBuilder stripped = new StringBuilder();
        for (int i=0;i<nrml.length();++i)
        {
            if (Character.getType(nrml.charAt(i)) != Character.NON_SPACING_MARK)
            {
                stripped.append(nrml.charAt(i));
            }
        }
        return stripped.toString();
    }
}
KenE
instead of stripped += nrml.charAt(i) use a StringBuilder. you have O(n²) runtime hidden here.
Andreas Petersson
updated per above - nice catch!
KenE
+3  A: 

In case someone is interested, I was looking for something similar and ended writing the following:

    public static string NormalizeStringForUrl(string name)
    {
        String normalizedString = name.Normalize(NormalizationForm.FormD);
        StringBuilder stringBuilder = new StringBuilder();

        foreach (char c in normalizedString)
        {
            switch (CharUnicodeInfo.GetUnicodeCategory(c))
            {
                case UnicodeCategory.LowercaseLetter:
                case UnicodeCategory.UppercaseLetter:
                case UnicodeCategory.DecimalDigitNumber:
                    stringBuilder.Append(c);
                    break;
                case UnicodeCategory.SpaceSeparator:
                case UnicodeCategory.ConnectorPunctuation:
                case UnicodeCategory.DashPunctuation:
                    stringBuilder.Append('_');
                    break;
            }
        }
        string result = stringBuilder.ToString();
        return String.Join("_", result.Split(new char[] { '_' }
            , StringSplitOptions.RemoveEmptyEntries)); // remove duplicate underscores
    }
Luk
You should preallocate the StringBuilder buffer to the name.Length to minimize memory allocation overhead.That last Split/Join call to remove sequential duplicate _ is interesting. Perhaps we should just avoid adding them in the loop. Set a flag for the previous character being an _ and not emit one if true.
IDisposable
2 really good points, I'll rewrite it if I ever get the time to go back to this portion of code :)
Luk
+1  A: 

This works fine in java.

It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics. Now you can use a regex to strip off the diacritics.

import java.text.Normalizer;
import java.util.regex.Pattern;

public String deAccent(String str) {
    String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD); 
    Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
    return pattern.matcher(nfdNormalizedString).replaceAll("");
}
hashable
+1  A: 

this did the trick for me...

string accentedStr;
byte[] tempBytes;
tempBytes = System.Text.Encoding.GetEncoding("ISO-8859-8").GetBytes(accentedStr);
string asciiStr = System.Text.Encoding.UTF8.GetString(tempBytes);

quick&short!

azrafe7
A: 

THIS IS THE VB VERSION (Works with GREEK) :

Imports System.Text

Imports System.Globalization

Public Function RemoveDiacritics(ByVal s As String)
    Dim normalizedString As String
    Dim stringBuilder As New StringBuilder
    normalizedString = s.Normalize(NormalizationForm.FormD)
    Dim i As Integer
    Dim c As Char
    For i = 0 To normalizedString.Length - 1
        c = normalizedString(i)
        If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
            stringBuilder.Append(c)
        End If
    Next
    Return stringBuilder.ToString()
End Function
Stefanos Michanetzis