views:

2269

answers:

6

I'm trying to compare names without any punctuation, spaces, accents etc. At the moment I am doing the following:

-(NSString*) prepareString:(NSString*)a {
    //remove any accents and punctuation;
    a=[[[NSString alloc] initWithData:[a dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES] encoding:NSASCIIStringEncoding] autorelease];

    a=[a stringByReplacingOccurrencesOfString:@" " withString:@""];
    a=[a stringByReplacingOccurrencesOfString:@"'" withString:@""];
    a=[a stringByReplacingOccurrencesOfString:@"`" withString:@""];
    a=[a stringByReplacingOccurrencesOfString:@"-" withString:@""];
    a=[a stringByReplacingOccurrencesOfString:@"_" withString:@""];
    a=[a lowercaseString];
    return a;
}

However, I need to do this for hundreds of strings and I need to make this more efficient. Any ideas?

Thanks, Deelo

+3  A: 

Consider using the RegexKit framework. You could do something like:

NSString *searchString      = @"This is neat.";
NSString *regexString       = @"[\W]";
NSString *replaceWithString = @"";
NSString *replacedString    = [searchString stringByReplacingOccurrencesOfRegex:regexString withString:replaceWithString];

NSLog (@"%@", replacedString);
//... Thisisneat
Alex Reynolds
How do I use regex to remove all punctuation without having several statements? I'm trying to avoid going over the string several times.
You only need to go over the original string once. The regex ("regular expression") removes all punctuation at once, replacing all non-alphanumeric characters with a blank ("").
Alex Reynolds
+1  A: 

You could use an NSCharacterSet and the characterIsMember: method to check every character of the target string and build a new string from the result.

// Start with string to filter and an empty mutable string to build into
NSString *stringToFilter = @"filter-me";
NSMutableString *targetString = [NSMutableString string];

// Define the character set that's OK to use
NSCharacterSet *okCharacterSet = 
  [NSCharacterSet characterSetWithCharactersInString:@"abcdefghijklmnopqrstuvwxyz"];

// Iterate over characters in the string, checking each
for(int i = 0; i < [stringToFilter length]; i++) {
    unichar currentChar = [stringToFilter characterAtIndex:i];
    if([okCharacterSet characterIsMember:currentChar]) {
        [targetString appendFormat:@"%C", currentChar];
    }
}

// targetString now contains the filtered string

Disclaimers: I have neither tested this code nor used it before, so I can't speak for its accuracy or efficiency. But it's an option.

Tim
great man my problem was similar to this but your answer helped me lots
Ranjeet Sajwan
+8  A: 
NSString* finish = [[start componentsSeparatedByCharactersInSet:[[NSCharacterSet letterCharacterSet] invertedSet]] componentsJoinedByString:@""];
Peter N Lewis
+2  A: 

Consider using NSScanner, and specifically the methods -setCharactersToBeSkipped: (which accepts an NSCharacterSet) and -scanString:intoString: (which accepts a string and returns the scanned string by reference).

You may also want to couple this with -[NSString localizedCompare:], or perhaps -[NSString compare:options:] with the NSDiacriticInsensitiveSearch option. That could simplify having to remove/replace accents, so you can focus on removing puncuation, whitespace, etc.

If you must use an approach like you presented in your question, at least use an NSMutableString and replaceOccurrencesOfString:withString:options:range: — that will be much more efficient than creating tons of nearly-identical autoreleased strings. It could be that just reducing the number of allocations will boost performance "enough" for the time being.

Quinn Taylor
+4  A: 

Before using any of these solutions, don't forget to use decomposedStringWithCanonicalMapping to decompose any accented letters. This will turn, for example, é (U+00E9) into e ‌́ (U+0065 U+0301). Then, when you strip out the non-alphanumeric characters, the unaccented letters will remain.

The reason why this is important is that you probably don't want, say, “fréd” and “früd”* to be treated as the same. If you stripped out all accented letters, as some of these solutions may do, you'll end up with “frd”, so those strings will compare as equal.

So, you should decompose them first, so that you can strip the accents and leave the letters.

*Made-up words, as I only know English. If somebody can offer a real example, I'd be happy to edit it in.

Peter Hosey
Français -> Franais
Mk12
I think Peter is trying to demonstrate 2 words with the same letters and different accents. :-)
Quinn Taylor
Quinn Taylor: Yup.
Peter Hosey
A: 

I wish people would stop posting solutions for this and say "Use RegEx framework." If you check, you cannot use any other frameworks or APIs anymore. So please post a viable response. i have been searching for days now for a simple way and nobody gives a good coded response.

Tom