views:

519

answers:

3

I have a List containing a lot of paths. I have a specific path I want to check against this list to see if there are any paths there that uses this path, ie:

f.StartsWith(r.FILENAME) && f != r.FILENAME

What would be the fastest way of doing this?

edit: Complete function from answer below:

static bool ContainsFragment(string[] paths, string fragment)
{
    // paths **must** be pre-sorted via Array.Sort(paths);
    if (paths.Length == 0) return false;
    int index = Array.BinarySearch(paths, fragment);
    if (index >= 0 && index+1 < paths.Length)
    { //we found it 
        if (paths[index + 1].StartsWith(fragment) &&
            paths[index + 1].EndsWith(".manifest"))
        {
            return true;
        }
    }
    return false;
}
+5  A: 

The fastest way is probably with binary search:

static bool ContainsFragment(string[] paths, string fragment)
{
    // paths **must** be pre-sorted via Array.Sort(paths);
    if (paths.Length == 0) return false;
    int index = Array.BinarySearch(paths, fragment);
    // we want the index of the *next highest* path
    if (index < 0) { // no match
        index = ~index; 
    } else { // exact match
        index++; // for strict substring (non-equal)
    }
    return index < paths.Length && paths[index].StartsWith(fragment);
}

But the cost of sorting the array will outweigh any benefit if you are only doing it a few times; in which case, just scan the array - either with LINQ etc, or just:

bool found = false;
for(int i = 0 ; i < paths.Length ; i++) {
    if(paths[i].StartsWith(fragment) &&
          paths[i].Length != fragment.Length)
    {
        found = true;
        break;
    }
}
Marc Gravell
The check is going to be done from 50000 to a few hundred thousand times, so the sort pays off. What I want is if there any hits that is not the exact match (there will ALWAYS be an exact match) but a match that starts the same but is not exact the same.
devzero
I still feel this is not the fastest way though, as i belive we are still looping through the array til we find the match we want.
devzero
No, binary search doesn't loop - it halves. So it will search 50000 in 15 steps. I suspect (since there will always be an exact match) you also need to check the *next* item after the match...
Marc Gravell
Answer updated to show how (I think)
Marc Gravell
If you see in my question, I've updated with a function that I believe will work. I don't quite understand what you do with the index = ~index; and the last return statement you have
devzero
It is actually very similar to your code; if it *isn't* found, ~index returns the index of the next item - the same as the "index + 1" in your code. In my "found a match" version I just used index++, but the result is the same.
Marc Gravell
The last return statement is largely identical to your code - simply that you did the bounds checking in the previous statement, where-as I did it on this line. And the original question didn't mention .manifest, which is why I didn't mention it either ;-p
Marc Gravell
+3  A: 
var matches = list.Where(f => f.StartsWith(r.FILENAME) && f != r.FILENAME);

Or if you only care about existence:

bool any = list.Any(f => f.StartsWith(r.FILENAME) && f != r.FILENAME);

This is assuming you're using .NET 3.5, admittedly - otherwise there are similar methods in List<T> and you can use an anonymous method.

Jon Skeet
This is not any faster than doing a simple for loop, maybe even slower.
devzero
+2  A: 

What I find interesting about this relatively simple question is the number of "valid" answers, depending on how you define "fastest."

  • If fastest means "least costly," then Marc's binary search method sounds like your answer.
  • If fastest means "quickest to implement," then Jon's list.Any method call is appropriate.
  • If fastest means "brute force," then you may want to look at parallelizing the search. It will be more costly in terms of processing required, but may execute more quickly depending on your server resources. PLINQ gives you a good starting point for this.
Michael Meadows
You are quite correct. My question is probably a combination of "least costly" and "brute force". In one example here I have 250 000 records that I need to compare (ie foreach record in the set, se if there is another with the same start but longer in the recordset. ) (continued in next comment)
devzero
I need output the records sequentially, but the order of checking is not relevant. So perhaps a combination of Marc's binary search and a multithreaded approach would be best. PLINQ might do this automagicaly for me, but I have my doubts :) Thanks for the tip though.
devzero
PLINQ would just parallelize the iteration, which could speed up certain search algorithms. If the list is already sorted, the binary search is the best option, and can't be parallelized.
Michael Meadows
The sorting, however, can. You might want to pose another question asking about the best sorting algorithm for a 250k record list. The availability of multiple processors/cores will influence this much more heavily than the search.
Michael Meadows