views:

62

answers:

3

Let's assume I can have the following strings:

"hey @john..."
"@john, hello"
"@john(hello)"

I am tokenizing the string to get every word separated by a space:

[myString componentsSeparatedByString:@" "];

My array of tokens now contain:

@john...
@john,
@john(hello)

I am checking for punctation marks as follows:

NSRange textRange = [words rangeOfString:@","];
if(textRange.location != NSNotFound){ } //do something

For these cases. How can I make sure only @john is tokenized, while retaining the trailing characters:

...
,
(hello)

Note: I would like to be able to handle all cases of characters at the end of a string. The above are just 3 examples.

+1  A: 

See NSString's -rangeOfString:options:range:... give it a range of { [myString length] - [searchString length], [searchString length] } and see if the resulting range's location is equal to NSNotFound. See the NSStringCompareOptions options in the docs for case sensitivity, etc.

Joshua Nozzi
A: 

You could use NSScanner and NSCharacterSet to do this. NSScanner can scan a string up to the first occurrence of a character in a set. If you get the +alphaNumericCharacterSet and then call -invertedSet on it, you'll get a set of all non-alphanumeric characters.

This is probably not super-efficient but it will work:

NSArray* strings = [NSArray arrayWithObjects:
                    @"hey @john...",
                    @"@john, hello",
                    @"@john(hello)",
                    nil];

//get the characters we want to skip, which is everything except letters and numbers
NSCharacterSet* illegalChars = [[NSCharacterSet alphanumericCharacterSet] invertedSet];


for(NSString* currentString in strings)
{
    //this stores the tokens for the current string
    NSMutableArray* tokens = [NSMutableArray array];

    //split the string into unparsed tokens
    NSArray* split = [currentString componentsSeparatedByString:@" "];

    for(NSString* currentToken in split)
    {
        //we only want tokens that start with an @ symbol
        if([currentToken hasPrefix:@"@"])
        {
            NSString* token = nil;

            //start a scanner from the first character after the @ symbol
            NSScanner* scanner = [NSScanner scannerWithString:[currentToken substringFromIndex:1]];
            //keep scanning until we hit an illegal character
            [scanner scanUpToCharactersFromSet:illegalChars intoString:&token];

            //get the rest of the string
            NSString* suffix = [currentToken substringFromIndex:[scanner scanLocation] + 1];

            if(token)
            {
                //store the token in a dictionary
                NSDictionary* tokenDict = [NSDictionary dictionaryWithObjectsAndKeys:
                                           [@"@" stringByAppendingString:token], @"token", //prepend the @ symbol that we skipped
                                           suffix, @"suffix",
                                           nil];
                [tokens addObject:tokenDict];
            }
        }
    }
    //output
    for(NSDictionary* dict in tokens)
    {
        NSLog(@"Found token: %@ additional characters: %@",[dict objectForKey:@"token"],[dict objectForKey:@"suffix"]);
    }
}
Rob Keniger
Nice solution. While this works, and can detect nonalphanumerics in my string, I still need to be able to retain the alphanumeric characters for user later.
Sheehan Alam
I've modified the example to also store the additional characters.
Rob Keniger
A: 

Are you sure CFStringTokenizer or its new Snow-Leopard-only Cocoa equivalent wouldn't be a better fit?

Splitting on just spaces is a very naïve way to tokenize, as you've found. CFStringTokenizer and enumerateSubstrings… are much smarter about real human-language lexical rules.

Peter Hosey