I'm trying to convert some strings that are in French Canadian and basically, I'd like to be able to take out the French accent marks in the letters while keeping the letter. (E.g. convert é
to e
.)
What is the best method for achieving this?
I'm trying to convert some strings that are in French Canadian and basically, I'd like to be able to take out the French accent marks in the letters while keeping the letter. (E.g. convert é
to e
.)
What is the best method for achieving this?
I don't really know what your situation is, but I would strongly encourage you not to do this. A good reference is Joel's article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
This question seems similar to how-to-remove-accents-and-tilde-in-a-c-stdstring
I've not used this method, but Michael Kaplan describes a method for doing so in his blog post (with a confusing title) that talks about stripping diacritics:
static string RemoveDiacritics(string stIn) {
string stFormD = stIn.Normalize(NormalizationForm.FormD);
StringBuilder sb = new StringBuilder();
for(int ich = 0; ich < stFormD.Length; ich++) {
UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
if(uc != UnicodeCategory.NonSpacingMark) {
sb.Append(stFormD[ich]);
}
}
return(sb.ToString().Normalize(NormalizationForm.FormC));
}
Note that this is a followup to his earlier post
The approach uses String.Normalize to split the input string into constituent glyphs (basically separating the "base" characters from the diacritics) and then scans the result and retains only the base characters. It's just a little complicated, but really you're looking at a complicated problem.
Of course, if you're limiting yourself to French, you could probably get away with the simple table-based approach in How to remove accents and tilde in a C++ std::string, as recommended by @David Dibben.
A warning: This approach might work in some specific cases, but in general you cannot just remove diacritical marks. In some cases and some languages this might change the meaning of the text.
You don't say why you want to do this; if it is for the sake of comparing strings or searching you are most probably better off by using a unicode-aware library for this.
Thanks to all, this is not going to be used for trying to display to the user for reading purposes (meaning that the diacritics don't really matter).
In case anyone's interested, here is the java equivalent:
import java.text.Normalizer;
public class MyClass
{
public static String removeDiacritics(String input)
{
String nrml = Normalizer.normalize(input, Normalizer.Form.NFD);
StringBuilder stripped = new StringBuilder();
for (int i=0;i<nrml.length();++i)
{
if (Character.getType(nrml.charAt(i)) != Character.NON_SPACING_MARK)
{
stripped.append(nrml.charAt(i));
}
}
return stripped.toString();
}
}
In case someone is interested, I was looking for something similar and ended writing the following:
public static string NormalizeStringForUrl(string name)
{
String normalizedString = name.Normalize(NormalizationForm.FormD);
StringBuilder stringBuilder = new StringBuilder();
foreach (char c in normalizedString)
{
switch (CharUnicodeInfo.GetUnicodeCategory(c))
{
case UnicodeCategory.LowercaseLetter:
case UnicodeCategory.UppercaseLetter:
case UnicodeCategory.DecimalDigitNumber:
stringBuilder.Append(c);
break;
case UnicodeCategory.SpaceSeparator:
case UnicodeCategory.ConnectorPunctuation:
case UnicodeCategory.DashPunctuation:
stringBuilder.Append('_');
break;
}
}
string result = stringBuilder.ToString();
return String.Join("_", result.Split(new char[] { '_' }
, StringSplitOptions.RemoveEmptyEntries)); // remove duplicate underscores
}
This works fine in java.
It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics. Now you can use a regex to strip off the diacritics.
import java.text.Normalizer;
import java.util.regex.Pattern;
public String deAccent(String str) {
String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
return pattern.matcher(nfdNormalizedString).replaceAll("");
}
this did the trick for me...
string accentedStr;
byte[] tempBytes;
tempBytes = System.Text.Encoding.GetEncoding("ISO-8859-8").GetBytes(accentedStr);
string asciiStr = System.Text.Encoding.UTF8.GetString(tempBytes);
quick&short!
THIS IS THE VB VERSION (Works with GREEK) :
Imports System.Text
Imports System.Globalization
Public Function RemoveDiacritics(ByVal s As String)
Dim normalizedString As String
Dim stringBuilder As New StringBuilder
normalizedString = s.Normalize(NormalizationForm.FormD)
Dim i As Integer
Dim c As Char
For i = 0 To normalizedString.Length - 1
c = normalizedString(i)
If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
stringBuilder.Append(c)
End If
Next
Return stringBuilder.ToString()
End Function