views:

80

answers:

4

With a given array of file names, the most simpliest way to sort it by file extension is like this:

Array.Sort(fileNames,
    (x, y) => Path.GetExtension(x).CompareTo(Path.GetExtension(y)));

The problem is that on very long list (~800k) it takes very long to sort, while sorting by the whole file name is faster for a couple of seconds!

Theoretical, there is a way to optimize it: instead of using Path.GetExtension() and compare the newly created extension-only-strings, we can provide a Comparison than compares the existing filename strings starting from the LastIndexOf('.') without creating new strings.

Now, suppose i found the LastIndexOf('.'), i want to reuse native .NET's StringComparer and apply it only to the part on string after the LastIndexOf('.'), to preserve all culture consideration. Didn't found a way to do that.

Any ideas?

Edit:

With tanascius's idea to use char.CompareTo() method, i came with my Uber-Fast-File-Extension-Comparer, now it sorting by extension 3x times faster! it even faster than all methods that uses Path.GetExtension() in some manner. what do you think?

Edit 2:

I found that this implementation do not considering culture since char.CompareTo() method do not considering culture, so this is not a perfect solution.

Any ideas?

    public static int CompareExtensions(string filePath1, string filePath2)
    {
        if (filePath1 == null && filePath2 == null)
        {
            return 0;
        }
        else if (filePath1 == null)
        {
            return -1;
        }
        else if (filePath2 == null)
        {
            return 1;
        }

        int i = filePath1.LastIndexOf('.');
        int j = filePath2.LastIndexOf('.');

        if (i == -1)
        {
            i = filePath1.Length;
        }
        else
        {
            i++;
        }

        if (j == -1)
        {
            j = filePath2.Length;
        }
        else
        {
            j++;
        }

        for (; i < filePath1.Length && j < filePath2.Length; i++, j++)
        {
            int compareResults = filePath1[i].CompareTo(filePath2[j]);

            if (compareResults != 0)
            {
                return compareResults;
            }
        }

        if (i >= filePath1.Length && j >= filePath2.Length)
        {
            return 0;
        }
        else if (i >= filePath1.Length)
        {
            return -1;
        }
        else
        {
            return 1;
        }
    }
+1  A: 

Create a new array that contains each of the filenames in ext.restofpath format (or some sort of pair/tuple format that can default sort on the extension without further transformation). Sort that, then convert it back.

This is faster because instead of having to retrieve the extension many times for each element (since you're doing something like N log N compares), you only do it once (and then move it back once).

Amber
+1  A: 

You can write a comparer that compares each character of the extension. char has a CompareTo(), too (see here).

Basically you loop until you have no more chars left in at least one string or one CompareTo() returns a value != 0.

EDIT: In response to the edits of the OP

The performance of your comparer method can be significantly improved. See the following code. Additionally I added the line

string.Compare( filePath1[i].ToString(), filePath2[j].ToString(), 
                m_CultureInfo, m_CompareOptions );

to enable the use of CultureInfo and CompareOptions. However this slows down everything compared to a version using a plain char.CompareTo() (about factor 2). But, according to my own SO question this seems to be the way to go.

public sealed class ExtensionComparer : IComparer<string>
{
    private readonly CultureInfo m_CultureInfo;
    private readonly CompareOptions m_CompareOptions;

    public ExtensionComparer() : this( CultureInfo.CurrentUICulture, CompareOptions.None ) {}

    public ExtensionComparer( CultureInfo cultureInfo, CompareOptions compareOptions )
    {
        m_CultureInfo = cultureInfo;
        m_CompareOptions = compareOptions;
    }

    public int Compare( string filePath1, string filePath2 )
    {
        if( filePath1 == null || filePath2 == null )
        {
            if( filePath1 != null )
            {
                return 1;
            }
            if( filePath2 != null )
            {
                return -1;
            }
            return 0;
        }

        var i = filePath1.LastIndexOf( '.' ) + 1;
        var j = filePath2.LastIndexOf( '.' ) + 1;

        if( i == 0 || j == 0 )
        {
            if( i != 0 )
            {
                return 1;
            }
            return j != 0 ? -1 : 0;
        }

        while( true )
        {
            if( i == filePath1.Length || j == filePath2.Length )
            {
                if( i != filePath1.Length )
                {
                    return 1;
                }
                return j != filePath2.Length ? -1 : 0;
            }
            var compareResults = string.Compare( filePath1[i].ToString(), filePath2[j].ToString(), m_CultureInfo, m_CompareOptions );
            //var compareResults = filePath1[i].CompareTo( filePath2[j] );
            if( compareResults != 0 )
            {
                return compareResults;
            }
            i++;
            j++;
        }
    }
}

Usage:

fileNames1.Sort( new ExtensionComparer( CultureInfo.GetCultureInfo( "sv-SE" ),
                    CompareOptions.StringSort ) );
tanascius
A: 

the main problem here is that you are calling Path.GetExtension multiple times for each path. if this is doing a quicksort then you could expect Path.GetExtension to be called anywhere from log(n) to n times where n is the number of items in your list for each item in the list. So you are going to want to cache the calls to Path.GetExtension.

if you were using linq i would suggest something like this:

filenames.Select(n => new {name=n, ext=Path.GetExtension(n)})
         .OrderBy(t => t.ext).ToArray();

this ensures that Path.GetExtension is only called once for each filename.

luke
+1  A: 

Not the most memory efficient but the fastest according to my tests:

SortedDictionary<string, List<string>> dic = new SortedDictionary<string, List<string>>();
foreach (string fileName in fileNames)
{
   string extension = Path.GetExtension(fileName);
   List<string> list;
   if (!dic.TryGetValue(extension, out list))
   {
      list = new List<string>();
      dic.Add(extension, list);
   }
   list.Add(fileName);
}
string[] arr = dic.Values.SelectMany(v => v).ToArray();

Did a mini benchmark on 800k randomly generated 8.3 filenames:

Sorting items with Linq to Objects... 00:00:04.4592595

Sorting items with SortedDictionary... 00:00:02.4405325

Sorting items with Array.Sort... 00:00:06.6464205

Julien Lebosquain
Are you sure about `list` declaration and assigment? Will if fail if `TryGetValue` returns `true`?
abatishchev
If the extension is already added, `TryGetValue` will return true and list will be set so we're good to add an item to it, otherwise we simply create a new list and add it to the dictionary.
Julien Lebosquain