ansaurus

Question

Using C# to detect whether a filename character is considered international

Answer 1

+1 A:

(a) Simple. Check for any code points that are greater than 127.

(b) Try NKFD normalization and/or uni2ascii.

dan04 2010-03-20 06:24:14

Which byte is the code point? I could investigate this but if you know I'd appreciate a hint.The uni2ascii utility does not seem to be available for Windows, although C source is provided so I could look at that. Would prefer not having to invent the wheel by implementing the normalizations myself - is there not a C# library or Windows API for this?

Morten Mertner 2010-03-20 06:35:41

A Unicode code-point is a 21-bit number. This can be encoded as 1-4 bytes in UTF-8, 1-2 UTF-16 code units, or 1 UTF-32 code unit. All 3 of these use single code units in the 0-127 range for ASCII characters.The Windows API has a function called NormalizeString.

dan04 2010-03-20 07:56:45

Thanks, I'll dig into this.

Morten Mertner 2010-03-20 17:28:47

Answer 2

A:

If you don't mind brute force, you can try something like this:

string name = "Århus.txt";
string kd = name.Normalize(NormalizationForm.FormKD);
byte[] kd_bytes = Encoding.Unicode.GetBytes(kd);
byte[] ascii_bytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, kd_bytes);
string flattened = Encoding.ASCII.GetString(ascii_bytes);

This will convert Århus.txt to A?rhus.txt, because the KD form breaks the Å apart, and the conversion to 7-bit ASCII loses the diacritical mark. What to do with the little ?'s left over is up to you.

Your mileage may vary on the other characters, but I would guess the KD normalization should do the trick. I have not worked on code page conversions for years now, but I found the question intriguing.

EDIT:

I just tried æÆØ and they all converted to ?, so this may be too lossy for you. Still, it may give you some clues that lead to an answer.

Jim Flood 2010-03-20 07:43:20

Thanks, I'll try to experiment with this approach.

Morten Mertner 2010-03-20 17:29:13

Answer 3

+1 A:

Sad problem to have in this day and age. Clearly the NFD form that the MAC uses is causing you this headache. One thing you could consider is removing the diacritics from the glyphs that causes NFD to be different from NFC.

I'm not 100% sure this is completely accurate (especially for Asian scripts), but it ought to be close:

public static string RemoveDiacriticals(string txt) {
  string nfd = txt.Normalize(NormalizationForm.FormD);
  StringBuilder retval = new StringBuilder(nfd.Length);
  foreach (char ch in nfd) {
    if (ch >= '\u0300' && ch <= '\u036f') continue;
    if (ch >= '\u1dc0' && ch <= '\u1de6') continue;
    if (ch >= '\ufe20' && ch <= '\ufe26') continue;
    if (ch >= '\u20d0' && ch <= '\u20f0') continue;
    retval.Append(ch);
  }
  return retval.ToString();
}

Hans Passant 2010-03-20 11:48:23

This looks like what I was looking for. I think I might go for an approach where I normalize the string with different norms and compare the results. This, combined with dan04's answer, should solve the first part of the puzzle.I still need to figure out what the best ASCII replacement characters are, and preferably with a code solution that doesn't require tables or dictionaries. Will post a new question for that once I've got some updated code to show.

Morten Mertner 2010-03-20 17:21:29

ansaurus

tags:

views:

answers:

Using C# to detect whether a filename character is considered international

related questions