views:

4666

answers:

5

I have an NSString like so:

@"200hello"

or

@"0 something"

What I would like to be able to do is take the first occuring number in the NSString and convert it into an int.

So that @"200hello" would become int = 200.

and @"0 something" would become int = 0.

+2  A: 

I would probably use a regular expression (implemented with the stellar RegexKitLite). Then it'd be something like:

#import "RegexKitLite.h"
NSString * original = @"foo 220hello";
NSString * number = [original stringByMatching:@"[^\\d]*(\\d+)" capture:1];
return [number integerValue];

The regex @"[^\d]*(\d+)" means "any number of non-numeric characters followed by at least one numeric character".

Dave DeLong
downvote for overkill.
Nikolai Ruhe
well I'm not going to down vote because you may have RegexKitLite in your project, but for my purposes yes its a bit too much. Thanks anyway +1
Brock Woolf
No downvote here, but I agree that NSScanner is a much cleaner way to do this. Particularly because you would have to write a new regex for different number types, where as with NSScanner you could just switch to scanFloat: or scanDouble: etc.
Quinn Taylor
+10  A: 
int value;
BOOL success = [[NSScanner scannerWithString:@"1000safkaj"] scanInteger:&value];

If the number is not always at the beginning:

NSCharacterSet* nonDigits = [[NSCharacterSet decimalDigitCharacterSet] invertedSet];
int value = [[@"adfsdg1000safkaj" stringByTrimmingCharactersInSet:nonDigits] intValue];
Nikolai Ruhe
This only works if the number is the first part of the string. Granted, that's what the example seems to imply, but my answer will work even if there are letters before the numbers. Yours won't. =)
Dave DeLong
Looks like a good answer.
Brock Woolf
Edited for complete answer.
Nikolai Ruhe
Nice workaround for characters before and after. I wasn't aware of the -invertedSet method on NSCharacterSet. +1
Dave DeLong
Now I'm feeling bad for down voting your answer.
Nikolai Ruhe
You can always "un-down vote" his answer by clicking on the down arrow again.
Brock Woolf
OK, un-down-vote done, you talked me into this. But don't come complaining when people start building nuclear power plants when their iPhone batteries run low.
Nikolai Ruhe
+5  A: 

If the int value is always at the beginning of the string, you can simply use intValue.

NSString *string = @"123hello";
int myInt = [string intValue];
zpasternack
Again, just like @Nikiolai Ruhe's answer, this will only work if there aren't any letters before the number. This might be the case, but the question says "the first occuring number", which means there might be letters before it. =)
Dave DeLong
Yes, but the asker's examples show the number always at the beginning. And technically, the asker is confusing "int" and "Integer", so he might also want to use -[NSString integerValue] instead.
Quinn Taylor
simple and works great.
Shivan Raptor
+5  A: 

Steve Ciarcia once said a single measured result is worth more than a hundred engineers opinions. And so begins the first, and last, "How to get an int value from a NSString" cook-off!

The following are the contenders: (microseconds taken and number of bytes used per match using the incredibly high precision for(x=0; x<100000; x++) {} micro-benchmark that has been handed down through the generations. Time measured via getrusage(), bytes used via malloc_size(). The string to be matched was normalized to 'foo 2020hello' for all cases, except those that required the number to be at the start. All conversions were normalized to 'int'. The two numbers after the time are normalized results relative to the best and worst performers.)

EDIT: These were the original numbers posted, see below for updated numbers. Also, times are from a 2.66 Core2 macbook pro.

characterSet   time: 1.36803us 12.5 / 1.00 memory: 64 bytes (via Nikolai Ruhe)
original RKL   time: 1.20686us 11.0 / 0.88 memory: 16 bytes (via Dave DeLong)
modified RKL   time: 1.07631us  9.9 / 0.78 memory: 16 bytes (me, changed regex to \d+)
scannerScanInt time: 0.49951us  4.6 / 0.36 memory: 32 bytes (via Nikolai Ruhe)
intValue       time: 0.16739us  1.5 / 0.12 memory:  0 bytes (via zpasternack)
rklIntValue    time: 0.10925us  1.0 / 0.08 memory:  0 bytes (me, modified RKL example)

As I noted somewhere else in this message, I originally threw this in to a unit test harness I use for RegexKitLite. Well, being the unit test harness meant that I was testing with my private copy of RegexKitLite... which just so happened to have a bunch of debug stuff tacked on while tracking down a bug report from a user. The above timing results are approximately equivalent to calling [valueString flushCachedRegexData]; inside the for() {} timing loop (which was essentially what the inadvertent debugging stuff was doing). The following results are from compiling against the latest, unmodified, RegexKitLite available (3.1):

characterSet   time: 1.36803us 12.5 / 1.00 memory: 64 bytes (via Nikolai Ruhe)
original RKL   time: 0.58446us  5.3 / 0.43 memory: 16 bytes (via Dave DeLong)
modified RKL   time: 0.54628us  5.0 / 0.40 memory: 16 bytes (me, changed regex to \d+)
scannerScanInt time: 0.49951us  4.6 / 0.36 memory: 32 bytes (via Nikolai Ruhe)
intValue       time: 0.16739us  1.5 / 0.12 memory:  0 bytes (via zpasternack)
rklIntValue    time: 0.10925us  1.0 / 0.08 memory:  0 bytes (me, modified RKL example)

This is slightly better than a 50% improvement. If you're willing to live slightly dangerously, you can coax a bit more speed out with the -DRKL_FAST_MUTABLE_CHECK compile time option:

original RKL   time: 0.51188us  4.7 / 0.37 memory: 16 bytes using intValue
modified RKL   time: 0.47665us  4.4 / 0.35 memory: 16 bytes using intValue
original RKL   time: 0.44337us  4.1 / 0.32 memory: 16 bytes using rklIntValue
modified RKL   time: 0.42128us  3.9 / 0.31 memory: 16 bytes using rklIntValue

This is usually good for about another 10% boost, and it's fairly safe to use (for more info, see the RKL docs). And while I was at it... why not use the faster rklIntValue too? Is there some kind of prize for beating the native, built in Foundation methods using an external, third party, non-integrated general purpose regex pattern matching engine? Don't believe the hype that "regexes are slow".

END EDIT

The RegexKitLite example can be found at RegexKitLite Fast Hex Conversion. Basically swapped strtoimax for strtol, and added a line of code to skip over leading characters that weren't [+-0-9]. (full disclosure: I'm the author of RegexKitLite)

Both 'scannerScanInt' and 'intValue' suffer from the problem that the number to be extracted must be at the start of the string. I think both will skip any leading white-space.

I modified Dave DeLongs regex from '[^\d]*(\d+)' to just '\d+' because that's all that's really needed, and it manages to get rid of a capture group usage to boot.

So, based on the above data, I offer the following recommendations:

There's basically two different capability classes here: Those that can tolerate extra 'stuff' and still get you the number (characterSet, RegexKitLite matchers, and rklIntValue), and those that basically need the number to be the very first thing in the string, tolerating at most some white space padding at the start (scannerScanInt and intValue).

Do not use NSCharacterClass to do these kinds of things. For the given example, 16 bytes is used to instantiate the first NSCharacterClass, then 32 bytes for the inverted version, and finally 16 bytes for the string result. The fact that a general purpose regex engine outperforms it by a double digit percentage margin while using less memory pretty much seals the deal.

(keep in mind I wrote RegexKitLite, so take the following with whatever sized grain of salt you feel is appropriate).

RegexKitLite turns in good times and uses the smallest amount of memory possible considering the fact that it's returning a NSString object. Since it uses a LRU cache internally for all the ICU regex engine stuff, those costs get amortized over time and repeated uses. It also takes seconds to change the regex if the need comes up (hex values? hex floats? Currencies? Dates? No problem.)

For the simple matchers, it should be obvious that you definitely should NOT use NSScanner to do these kinds of things. Using NSScanner to do a 'scanInt:' is no different than just calling [aString intValue]. The produce the same results with the same caveats. The difference is NSScanner takes FIVE times longer to the same thing, while wasting 32 bytes of memory in the process.... while [aString intValue] (probably) doesn't require one byte of memory to perform its magic- it probably just calls strtoimax() (or an equivalent) and since it has direct access to the pointer holding the strings contents....

The final one is 'rklIntValue', which again is just a slightly tweaked version of what you can find at (the 'RegexKitLite Fast Hex Conversion' link above, stackoverflow won't let me post it twice). It uses CoreFoundation to try to get direct access to the strings buffer, and failing that, allocates some space off the stack and copies a chunk of the string to that buffer. This takes all of, oh, three instructions on the CPU, and is fundamentally impossible to 'leak' like a malloc() allocation. So it uses zero memory and goes very, very fast. As an extra bonus, you pass to strtoXXX() the number base of the string to convert. 10 for decimal, 16 for hex (automatically swallowing a leading 0x if present), or 0 for automagic detection. It's a trivial, single line of code to skip the pointer over any 'uninteresting' characters until you get to what you want (I choose -,+, and 0-9). Also trivial to swap in something like strtod() if you need to parse double values. strtod() converts just about any valid floating point text: NAN, INF, hex floats, you name it.

EDIT:

Per request of the OP, here's a trimmed and minified version of the code that I used to perform the tests. One thing of note: While putting this together, I noticed that Dave DeLongs original regex didn't quite work. The problem is in the negated character set- meta-character sequences inside sets (ie, [^\d]+) mean the literal character, not the special meaning they have outside the character set. Replaced with [^\p{DecimalNumber}]*, which has the intended effect.

I originally bolted this stuff to a RegexKitLite unit test harness, so I left some bits and pieces for GC in. I forgot all about this, but the short version of what happens when GC is turned on is that times of everything BUT RegexKitLite double (that is, takes twice as long). RKL only takes about 75% longer (and that took an enormous, non-trivial amount of effort to get when I was developing it). The rklIntValue time stays exactly the same.

Compile with

shell% gcc -DNS_BLOCK_ASSERTIONS -mdynamic-no-pic -std=gnu99 -O -o stackOverflow stackOverflow.m RegexKitLite.m -framework Foundation -licucore -lauto

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <stdint.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <objc/objc-auto.h>
#include <malloc/malloc.h>

#import <Foundation/Foundation.h>
#import "RegexKitLite.h"

static double cpuTimeUsed(void);
static double cpuTimeUsed(void) {
  struct rusage currentRusage;

  getrusage(RUSAGE_SELF, &currentRusage);
  double userCPUTime   = ((((double)currentRusage.ru_utime.tv_sec) * 1000000.0) + ((double)currentRusage.ru_utime.tv_usec)) / 1000000.0;
  double systemCPUTime = ((((double)currentRusage.ru_stime.tv_sec) * 1000000.0) + ((double)currentRusage.ru_stime.tv_usec)) / 1000000.0;
  double CPUTime = userCPUTime + systemCPUTime;
  return(CPUTime);
}

@interface NSString (IntConversion)
-(int)rklIntValue;
@end

@implementation NSString (IntConversion)

-(int)rklIntValue
{
  CFStringRef cfSelf = (CFStringRef)self;
  UInt8 buffer[64];
  const char *cptr, *optr;
  char c;

  if((cptr = optr = CFStringGetCStringPtr(cfSelf, kCFStringEncodingMacRoman)) == NULL) {
    CFRange range     = CFRangeMake(0L, CFStringGetLength(cfSelf));
    CFIndex usedBytes = 0L;
    CFStringGetBytes(cfSelf, range, kCFStringEncodingUTF8, '?', false, buffer, 60L, &usedBytes);
    buffer[usedBytes] = 0U;
    cptr = optr       = (const char *)buffer;
  }

  while(((cptr - optr) < 60) && (!((((c = *cptr) >= '0') && (c <= '9')) || (c == '-') || (c == '+'))) ) { cptr++; }
  return((int)strtoimax(cptr, NULL, 0));
}

@end

int main(int argc __attribute__((unused)), char *argv[] __attribute__((unused))) {
  NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];

#ifdef __OBJC_GC__
  objc_start_collector_thread();
  objc_clear_stack(OBJC_CLEAR_RESIDENT_STACK);
  objc_collect(OBJC_EXHAUSTIVE_COLLECTION | OBJC_WAIT_UNTIL_DONE);
#endif

  BOOL gcEnabled = ([objc_getClass("NSGarbageCollector") defaultCollector] != NULL) ? YES : NO;
  NSLog(@"Garbage Collection is: %@", gcEnabled ? @"ON" : @"OFF");
  NSLog(@"Architecture: %@", (sizeof(void *) == 4UL) ? @"32-bit" : @"64-bit");

  double      startTime = 0.0, csTime = 0.0, reTime = 0.0, re2Time = 0.0, ivTime = 0.0, scTime = 0.0, rklTime = 0.0;
  NSString   *valueString = @"foo 2020hello", *value2String = @"2020hello";
  NSString   *reRegex = @"[^\\p{DecimalNumber}]*(\\d+)", *re2Regex = @"\\d+";
  int         value = 0;
  NSUInteger  x = 0UL;

  {
    NSCharacterSet *digits      = [NSCharacterSet decimalDigitCharacterSet];
    NSCharacterSet *nonDigits   = [digits invertedSet];
    NSScanner      *scanner     = [NSScanner scannerWithString:value2String];
    NSString       *csIntString = [valueString stringByTrimmingCharactersInSet:nonDigits];
    NSString       *reString    = [valueString stringByMatching:reRegex capture:1L];
    NSString       *re2String   = [valueString stringByMatching:re2Regex];

    [scanner scanInt:&value];

    NSLog(@"digits      : %p, size: %lu", digits, malloc_size(digits));
    NSLog(@"nonDigits   : %p, size: %lu", nonDigits, malloc_size(nonDigits));
    NSLog(@"scanner     : %p, size: %lu, int: %d", scanner, malloc_size(scanner), value);
    NSLog(@"csIntString : %p, size: %lu, '%@' int: %d", csIntString, malloc_size(csIntString), csIntString, [csIntString intValue]);
    NSLog(@"reString    : %p, size: %lu, '%@' int: %d", reString, malloc_size(reString), reString, [reString intValue]);
    NSLog(@"re2String   : %p, size: %lu, '%@' int: %d", re2String, malloc_size(re2String), re2String, [re2String intValue]);
    NSLog(@"intValue    : %d", [value2String intValue]);
    NSLog(@"rklIntValue : %d", [valueString rklIntValue]);
  }

  for(x = 0UL, startTime = cpuTimeUsed(); x < 100000UL; x++) { value = [[valueString stringByTrimmingCharactersInSet:[[NSCharacterSet decimalDigitCharacterSet] invertedSet]] intValue]; } csTime = (cpuTimeUsed() - startTime) / (double)x;
  for(x = 0UL, startTime = cpuTimeUsed(); x < 100000UL; x++) { value =  [[valueString stringByMatching:reRegex capture:1L] intValue]; } reTime = (cpuTimeUsed() - startTime) / (double)x;
  for(x = 0UL, startTime = cpuTimeUsed(); x < 100000UL; x++) { value =  [[valueString stringByMatching:re2Regex] intValue]; } re2Time = (cpuTimeUsed() - startTime) / (double)x;
  for(x = 0UL, startTime = cpuTimeUsed(); x < 100000UL; x++) { value =  [valueString rklIntValue]; } rklTime = (cpuTimeUsed() - startTime) / (double)x;
  for(x = 0UL, startTime = cpuTimeUsed(); x < 100000UL; x++) { value = [value2String intValue]; } ivTime = (cpuTimeUsed() - startTime) / (double)x;
  for(x = 0UL, startTime = cpuTimeUsed(); x < 100000UL; x++) { [[NSScanner scannerWithString:value2String] scanInt:&value]; } scTime = (cpuTimeUsed() - startTime) / (double)x;

  NSLog(@"csTime : %.5lfus", csTime * 1000000.0);
  NSLog(@"reTime : %.5lfus", reTime * 1000000.0);
  NSLog(@"re2Time: %.5lfus", re2Time * 1000000.0);
  NSLog(@"scTime : %.5lfus", scTime * 1000000.0);
  NSLog(@"ivTime : %.5lfus", ivTime * 1000000.0);
  NSLog(@"rklTime: %.5lfus", rklTime * 1000000.0);

  [NSString clearStringCache];
  [pool release]; pool = NULL;

  return(0);
}
johne
Wow, excellent thanks for going to the effort to measure the performance! +1
Brock Woolf
Perhaps you could update with the results from my own answer which I just added. Also, can you add your code you wrote to perform the benchmark please.
Brock Woolf
A: 

I came up with my own answer, potentially faster and easier than the others provided.

My answer does assume you know the position the number begins and ends though...

NSString *myString = @"21sss";
int numberAtStart = [[myString substringToIndex:2] intValue];

You can get to to work the other way too:

NSString *myString = @"sss22";
int numberAtEnd = [[myString substringFromIndex:3] intValue];
Brock Woolf