views:

509

answers:

4

I have a class (colorClass) which holds 2 NSStrings (idNumber and favoriteColor). There is an NSMutableArray (arrayColor) that holds over 50,000 colorClass objects. What is the fastest way to find all duplicate idNumbers from all colorClass objects and return them in an array? Right now I'm using 1 for loop that goes copies arrayColor, then filters the copied array using an NSPredicate. This takes over 5 minutes to sort the array. How can this be done more efficiently?

+1  A: 

Have you thought about using an NSMutableSet instead? Sets don't allow duplicates in the first place, so your problem wouldn't exist. However, a set won't work if the order of the colors matter (since sets have no notion of ordering). I'm not sure of your specific case.

Marc W
Or perhaps an `NSMutableDictionary`, since we're talking about key-value pairs....
Sixten Otto
He's storing an object with a key-value pair, not a raw pair. A dictionary wouldn't make sense.
Marc W
I disagree. A big array of objects, each of which is a key+value, in which he is concerned about finding duplicates, seems an ideal case for a dictionary. Unless there's some compelling need, elsewhere, for having them all in a sequence.
Sixten Otto
Sixten Otto: I suggest posting that as a separate answer.
Peter Hosey
+3  A: 

"Fastest" would require profiling, but my inclination would be to make an NSCountedSet from the array, loop over that and return an array of items from the counted set that have a countForObject: greater than 1.

Chuck
+3  A: 

First question is: does order really matter? If not, then use an NSMutableSet or NSMutableDictionary (depending on what makes sense for your app)

The simplest way to eliminate duplicates is to prevent them from occurring in the first place. Before you add anything to your NSMutableArray, you could check to see if the value already exists. For example:

- (void)addColor:(NSString *)color withID:(NSString *)id {
    NSArray *duplicates = [myArray filteredArrayUsingPredicate:[NSPredicate predicateWithFormat:@"id == %@", id]];
    if ([duplicates count] > 0) {
        // Optionally report an error/throw an exception
        return;
    }
}

Otherwise, you're probably best off getting a list of IDs using valueForKeyPath:, then sorting that array, and then running through it once to look for duplicates. It would go soemthing like this:

- (NSSet *)checkForDuplicateIDs {
    NSArray *allIDs = [myArray valueForKeyPath:@"id"];
    NSArray *sortedIDs = [allIDs sortedArrayUsingSelector:@selector(compare:)];

    NSString *previousID = nil;
    NSMutableSet *duplicateIDs = [NSMutableSet set];
    for (NSString *anID in sortedIDs) {
        if ([previousID isEqualToString:anID]) {
            [duplicateIDs addObject:anID];
        }
        previousID = anID;
    }

    return [[duplicateIDs copy] autorelease];
}

Keep in mind, though, that sorting the list is still, at best, probably an O(n log(n)) operation. If you can at least keep your objects in order in your list, you can avoid the expense of sorting them. Preventing duplicates is best, keeping the list sorted is next best, and the algorithm I gave above is probably the worst.

Alex
A: 

So, to elaborate a little on my earlier comments: it's not clear to me from the question the context in which this data is actually being used. In particular, whether there is a need to keep all of these objects in a big long array. If there is not, then a dictionary might be a better choice of data structure instead of an array.

Since dictionaries are inherently key-value data structures, the ColorClass could probably be eliminated altogether, but I'll assume here that there's some other reason to keep it around, beyond what we know from the question.

If duplicates shouldn't be allowed to occur at all, then the dictionary could store single items, and the code might look something like this:

// colors is an NSMutableDictionary
- (ColorClass*)addColorIfPossible:(ColorClass*)color {
  ColorClass *existingColor = [[colors objectForKey:[color idNumber]] retain];
  if( existingColor == nil ) {
    [colors setObject:color forKey:[color idNumber]];
  }
  return [existingColor autorelease];
}

And if duplicates are allowed, but the need exists to fetch all of the objects with a common ID quickly, then a dictionary of either arrays or sets could work:

// colors is an NSMutableDictionary
- (void)addColor:(ColorClass*)color {
  NSMutableSet *colorSet = [colors objectForKey:[color idNumber]];
  if( !colorSet ) {
    // kInitialSetCapacity is a constant with some reasonable value you choose
    colorSet = [NSMutableSet setWithCapacity:kInitialSetCapacity];
    [colors setObject:colorSet forKey:[color idNumber]];
  }
  [colorSet addObject:color];
}

- (NSSet*)findDuplicatesForID:(NSString*)idNumber {
  // returns nil if no colors with that id, but could
  // return an empty set instead with little effort
  return [[[colors objectForKey:idNumber] copy] autorelease];
}

If there is a need in the app to both have the giant list of colors in overall order, and to quickly find duplicates, then the classic space vs. time tradeoff comes in: use just an array, or maintain both that array and a dictionary like this.

Sixten Otto