views:

238

answers:

3

I've written a small console application (source below) to locate and optionally rename files containing international characters, as they are a source of constant pain with most source control systems (some background on this below). The code I'm using has a simple dictionary with characters to look for and replace (and nukes every other character that uses more than one byte of storage), but it feels very hackish. What's the right way to (a) find out whether a character is international? and (b) what the best ASCII substitution character would be?

Let me provide some background information on why this is needed. It so happens that the danish Å character has two different encodings in UTF-8, both representing the same symbol. These are known as NFC and NFD encodings. Windows and Linux will create NFC encoding by default but respect whatever encoding it is given. Mac will convert all names (when saving to a HFS+ partition) to NFD and therefore returns a different byte stream for the name of a file created on Windows. This effectively breaks Subversion, Git and lots of other utilities that don't care to properly handle this scenario.

I'm currently evaluating Mercurial, which turns out to be even worse at handling international characters.. being fairly tired of these problems, either source control or the international character would have to go, and so here we are.

My current implementation:

public class Checker
{
    private Dictionary<char, string> internationals = new Dictionary<char, string>();
    private List<char> keep = new List<char>();
    private List<char> seen = new List<char>();

    public Checker()
    {
        internationals.Add( 'æ', "ae" );
        internationals.Add( 'ø', "oe" );
        internationals.Add( 'å', "aa" );
        internationals.Add( 'Æ', "Ae" );
        internationals.Add( 'Ø', "Oe" );
        internationals.Add( 'Å', "Aa" );

        internationals.Add( 'ö', "o" );
        internationals.Add( 'ü', "u" );
        internationals.Add( 'ä', "a" );
        internationals.Add( 'é', "e" );
        internationals.Add( 'è', "e" );
        internationals.Add( 'ê', "e" );

        internationals.Add( '¦', "" );
        internationals.Add( 'Ã', "" );
        internationals.Add( '©', "" );
        internationals.Add( ' ', "" );
        internationals.Add( '§', "" );
        internationals.Add( '¡', "" );
        internationals.Add( '³', "" );
        internationals.Add( '­', "" );
        internationals.Add( 'º', "" );

        internationals.Add( '«', "-" );
        internationals.Add( '»', "-" );
        internationals.Add( '´', "'" );
        internationals.Add( '`', "'" );
        internationals.Add( '"', "'" );
        internationals.Add( Encoding.UTF8.GetString( new byte[] { 226, 128, 147 } )[ 0 ], "-" );
        internationals.Add( Encoding.UTF8.GetString( new byte[] { 226, 128, 148 } )[ 0 ], "-" );
        internationals.Add( Encoding.UTF8.GetString( new byte[] { 226, 128, 153 } )[ 0 ], "'" );
        internationals.Add( Encoding.UTF8.GetString( new byte[] { 226, 128, 166 } )[ 0 ], "." );

        keep.Add( '-' );
        keep.Add( '=' );
        keep.Add( '\'' );
        keep.Add( '.' );
    }

    public bool IsInternationalCharacter( char c )
    {
        var s = c.ToString();
        byte[] bytes = Encoding.UTF8.GetBytes( s );
        if( bytes.Length > 1 && ! internationals.ContainsKey( c ) && ! seen.Contains( c ) )
        {
            Console.WriteLine( "X '{0}' ({1})", c, string.Join( ",", bytes ) );
            seen.Add( c );
            if( ! keep.Contains( c ) )
            {
                internationals[ c ] = "";
            }
        }
        return internationals.ContainsKey( c );
    }

    public bool HasInternationalCharactersInName( string name, out string safeName )
    {
        StringBuilder sb = new StringBuilder();
        Array.ForEach( name.ToCharArray(), c => sb.Append( IsInternationalCharacter( c ) ? internationals[ c ] : c.ToString() ) );
        int length = sb.Length;
        sb.Replace( "  ", " " );
        while( sb.Length != length )
        {
            sb.Replace( "  ", " " );
        }
        safeName = sb.ToString().Trim();
        string namePart = Path.GetFileNameWithoutExtension( safeName );
        if( namePart.EndsWith( "." ) )
            safeName = namePart.Substring( 0, namePart.Length - 1 ) + Path.GetExtension( safeName );
        return name != safeName;
    }
}

And this would be invoked like this:

FileInfo file = new File( "Århus.txt" );
string safeName;    
if( checker.HasInternationalCharactersInName( file.Name, out safeName ) )
{
    // rename file 
}
+1  A: 

(a) Simple. Check for any code points that are greater than 127.

(b) Try NKFD normalization and/or uni2ascii.

dan04
Which byte is the code point? I could investigate this but if you know I'd appreciate a hint.The uni2ascii utility does not seem to be available for Windows, although C source is provided so I could look at that. Would prefer not having to invent the wheel by implementing the normalizations myself - is there not a C# library or Windows API for this?
Morten Mertner
A Unicode code-point is a 21-bit number. This can be encoded as 1-4 bytes in UTF-8, 1-2 UTF-16 code units, or 1 UTF-32 code unit. All 3 of these use single code units in the 0-127 range for ASCII characters.The Windows API has a function called NormalizeString.
dan04
Thanks, I'll dig into this.
Morten Mertner
A: 

If you don't mind brute force, you can try something like this:

string name = "Århus.txt";
string kd = name.Normalize(NormalizationForm.FormKD);
byte[] kd_bytes = Encoding.Unicode.GetBytes(kd);
byte[] ascii_bytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, kd_bytes);
string flattened = Encoding.ASCII.GetString(ascii_bytes);

This will convert Århus.txt to A?rhus.txt, because the KD form breaks the Å apart, and the conversion to 7-bit ASCII loses the diacritical mark. What to do with the little ?'s left over is up to you.

Your mileage may vary on the other characters, but I would guess the KD normalization should do the trick. I have not worked on code page conversions for years now, but I found the question intriguing.

EDIT:

I just tried æÆØ and they all converted to ?, so this may be too lossy for you. Still, it may give you some clues that lead to an answer.

Jim Flood
Thanks, I'll try to experiment with this approach.
Morten Mertner
+1  A: 

Sad problem to have in this day and age. Clearly the NFD form that the MAC uses is causing you this headache. One thing you could consider is removing the diacritics from the glyphs that causes NFD to be different from NFC.

I'm not 100% sure this is completely accurate (especially for Asian scripts), but it ought to be close:

public static string RemoveDiacriticals(string txt) {
  string nfd = txt.Normalize(NormalizationForm.FormD);
  StringBuilder retval = new StringBuilder(nfd.Length);
  foreach (char ch in nfd) {
    if (ch >= '\u0300' && ch <= '\u036f') continue;
    if (ch >= '\u1dc0' && ch <= '\u1de6') continue;
    if (ch >= '\ufe20' && ch <= '\ufe26') continue;
    if (ch >= '\u20d0' && ch <= '\u20f0') continue;
    retval.Append(ch);
  }
  return retval.ToString();
}
Hans Passant
This looks like what I was looking for. I think I might go for an approach where I normalize the string with different norms and compare the results. This, combined with dan04's answer, should solve the first part of the puzzle.I still need to figure out what the best ASCII replacement characters are, and preferably with a code solution that doesn't require tables or dictionaries. Will post a new question for that once I've got some updated code to show.
Morten Mertner