Steve Ciarcia once said a single measured result is worth more than a hundred engineers opinions. And so begins the first, and last, "How to get an int value from a NSString" cook-off!
The following are the contenders: (microseconds taken and number of bytes used per match using the incredibly high precision for(x=0; x<100000; x++) {} micro-benchmark that has been handed down through the generations. Time measured via getrusage(), bytes used via malloc_size(). The string to be matched was normalized to 'foo 2020hello' for all cases, except those that required the number to be at the start. All conversions were normalized to 'int'. The two numbers after the time are normalized results relative to the best and worst performers.)
EDIT: These were the original numbers posted, see below for updated numbers. Also, times are from a 2.66 Core2 macbook pro.
characterSet time: 1.36803us 12.5 / 1.00 memory: 64 bytes (via Nikolai Ruhe)
original RKL time: 1.20686us 11.0 / 0.88 memory: 16 bytes (via Dave DeLong)
modified RKL time: 1.07631us 9.9 / 0.78 memory: 16 bytes (me, changed regex to \d+)
scannerScanInt time: 0.49951us 4.6 / 0.36 memory: 32 bytes (via Nikolai Ruhe)
intValue time: 0.16739us 1.5 / 0.12 memory: 0 bytes (via zpasternack)
rklIntValue time: 0.10925us 1.0 / 0.08 memory: 0 bytes (me, modified RKL example)
As I noted somewhere else in this message, I originally threw this in to a unit test harness I use for RegexKitLite. Well, being the unit test harness meant that I was testing with my private copy of RegexKitLite... which just so happened to have a bunch of debug stuff tacked on while tracking down a bug report from a user. The above timing results are approximately equivalent to calling [valueString flushCachedRegexData];
inside the for() {} timing loop (which was essentially what the inadvertent debugging stuff was doing). The following results are from compiling against the latest, unmodified, RegexKitLite available (3.1):
characterSet time: 1.36803us 12.5 / 1.00 memory: 64 bytes (via Nikolai Ruhe)
original RKL time: 0.58446us 5.3 / 0.43 memory: 16 bytes (via Dave DeLong)
modified RKL time: 0.54628us 5.0 / 0.40 memory: 16 bytes (me, changed regex to \d+)
scannerScanInt time: 0.49951us 4.6 / 0.36 memory: 32 bytes (via Nikolai Ruhe)
intValue time: 0.16739us 1.5 / 0.12 memory: 0 bytes (via zpasternack)
rklIntValue time: 0.10925us 1.0 / 0.08 memory: 0 bytes (me, modified RKL example)
This is slightly better than a 50% improvement. If you're willing to live slightly dangerously, you can coax a bit more speed out with the -DRKL_FAST_MUTABLE_CHECK
compile time option:
original RKL time: 0.51188us 4.7 / 0.37 memory: 16 bytes using intValue
modified RKL time: 0.47665us 4.4 / 0.35 memory: 16 bytes using intValue
original RKL time: 0.44337us 4.1 / 0.32 memory: 16 bytes using rklIntValue
modified RKL time: 0.42128us 3.9 / 0.31 memory: 16 bytes using rklIntValue
This is usually good for about another 10% boost, and it's fairly safe to use (for more info, see the RKL docs). And while I was at it... why not use the faster rklIntValue too? Is there some kind of prize for beating the native, built in Foundation methods using an external, third party, non-integrated general purpose regex pattern matching engine? Don't believe the hype that "regexes are slow".
END EDIT
The RegexKitLite example can be found at RegexKitLite Fast Hex Conversion. Basically swapped strtoimax for strtol, and added a line of code to skip over leading characters that weren't [+-0-9]. (full disclosure: I'm the author of RegexKitLite)
Both 'scannerScanInt' and 'intValue' suffer from the problem that the number to be extracted must be at the start of the string. I think both will skip any leading white-space.
I modified Dave DeLongs regex from '[^\d]*(\d+)' to just '\d+' because that's all that's really needed, and it manages to get rid of a capture group usage to boot.
So, based on the above data, I offer the following recommendations:
There's basically two different capability classes here: Those that can tolerate extra 'stuff' and still get you the number (characterSet, RegexKitLite matchers, and rklIntValue), and those that basically need the number to be the very first thing in the string, tolerating at most some white space padding at the start (scannerScanInt and intValue).
Do not use NSCharacterClass to do these kinds of things. For the given example, 16 bytes is used to instantiate the first NSCharacterClass, then 32 bytes for the inverted version, and finally 16 bytes for the string result. The fact that a general purpose regex engine outperforms it by a double digit percentage margin while using less memory pretty much seals the deal.
(keep in mind I wrote RegexKitLite, so take the following with whatever sized grain of salt you feel is appropriate).
RegexKitLite turns in good times and uses the smallest amount of memory possible considering the fact that it's returning a NSString object. Since it uses a LRU cache internally for all the ICU regex engine stuff, those costs get amortized over time and repeated uses. It also takes seconds to change the regex if the need comes up (hex values? hex floats? Currencies? Dates? No problem.)
For the simple matchers, it should be obvious that you definitely should NOT use NSScanner to do these kinds of things. Using NSScanner to do a 'scanInt:' is no different than just calling [aString intValue]. The produce the same results with the same caveats. The difference is NSScanner takes FIVE times longer to the same thing, while wasting 32 bytes of memory in the process.... while [aString intValue] (probably) doesn't require one byte of memory to perform its magic- it probably just calls strtoimax() (or an equivalent) and since it has direct access to the pointer holding the strings contents....
The final one is 'rklIntValue', which again is just a slightly tweaked version of what you can find at (the 'RegexKitLite Fast Hex Conversion' link above, stackoverflow won't let me post it twice). It uses CoreFoundation to try to get direct access to the strings buffer, and failing that, allocates some space off the stack and copies a chunk of the string to that buffer. This takes all of, oh, three instructions on the CPU, and is fundamentally impossible to 'leak' like a malloc() allocation. So it uses zero memory and goes very, very fast. As an extra bonus, you pass to strtoXXX() the number base of the string to convert. 10 for decimal, 16 for hex (automatically swallowing a leading 0x if present), or 0 for automagic detection. It's a trivial, single line of code to skip the pointer over any 'uninteresting' characters until you get to what you want (I choose -,+, and 0-9). Also trivial to swap in something like strtod() if you need to parse double values. strtod() converts just about any valid floating point text: NAN, INF, hex floats, you name it.
EDIT:
Per request of the OP, here's a trimmed and minified version of the code that I used to perform the tests. One thing of note: While putting this together, I noticed that Dave DeLongs original regex didn't quite work. The problem is in the negated character set- meta-character sequences inside sets (ie, [^\d]+) mean the literal character, not the special meaning they have outside the character set. Replaced with [^\p{DecimalNumber}]*, which has the intended effect.
I originally bolted this stuff to a RegexKitLite unit test harness, so I left some bits and pieces for GC in. I forgot all about this, but the short version of what happens when GC is turned on is that times of everything BUT RegexKitLite double (that is, takes twice as long). RKL only takes about 75% longer (and that took an enormous, non-trivial amount of effort to get when I was developing it). The rklIntValue time stays exactly the same.
Compile with
shell% gcc -DNS_BLOCK_ASSERTIONS -mdynamic-no-pic -std=gnu99 -O -o stackOverflow stackOverflow.m RegexKitLite.m -framework Foundation -licucore -lauto
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <stdint.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <objc/objc-auto.h>
#include <malloc/malloc.h>
#import <Foundation/Foundation.h>
#import "RegexKitLite.h"
static double cpuTimeUsed(void);
static double cpuTimeUsed(void) {
struct rusage currentRusage;
getrusage(RUSAGE_SELF, ¤tRusage);
double userCPUTime = ((((double)currentRusage.ru_utime.tv_sec) * 1000000.0) + ((double)currentRusage.ru_utime.tv_usec)) / 1000000.0;
double systemCPUTime = ((((double)currentRusage.ru_stime.tv_sec) * 1000000.0) + ((double)currentRusage.ru_stime.tv_usec)) / 1000000.0;
double CPUTime = userCPUTime + systemCPUTime;
return(CPUTime);
}
@interface NSString (IntConversion)
-(int)rklIntValue;
@end
@implementation NSString (IntConversion)
-(int)rklIntValue
{
CFStringRef cfSelf = (CFStringRef)self;
UInt8 buffer[64];
const char *cptr, *optr;
char c;
if((cptr = optr = CFStringGetCStringPtr(cfSelf, kCFStringEncodingMacRoman)) == NULL) {
CFRange range = CFRangeMake(0L, CFStringGetLength(cfSelf));
CFIndex usedBytes = 0L;
CFStringGetBytes(cfSelf, range, kCFStringEncodingUTF8, '?', false, buffer, 60L, &usedBytes);
buffer[usedBytes] = 0U;
cptr = optr = (const char *)buffer;
}
while(((cptr - optr) < 60) && (!((((c = *cptr) >= '0') && (c <= '9')) || (c == '-') || (c == '+'))) ) { cptr++; }
return((int)strtoimax(cptr, NULL, 0));
}
@end
int main(int argc __attribute__((unused)), char *argv[] __attribute__((unused))) {
NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];
#ifdef __OBJC_GC__
objc_start_collector_thread();
objc_clear_stack(OBJC_CLEAR_RESIDENT_STACK);
objc_collect(OBJC_EXHAUSTIVE_COLLECTION | OBJC_WAIT_UNTIL_DONE);
#endif
BOOL gcEnabled = ([objc_getClass("NSGarbageCollector") defaultCollector] != NULL) ? YES : NO;
NSLog(@"Garbage Collection is: %@", gcEnabled ? @"ON" : @"OFF");
NSLog(@"Architecture: %@", (sizeof(void *) == 4UL) ? @"32-bit" : @"64-bit");
double startTime = 0.0, csTime = 0.0, reTime = 0.0, re2Time = 0.0, ivTime = 0.0, scTime = 0.0, rklTime = 0.0;
NSString *valueString = @"foo 2020hello", *value2String = @"2020hello";
NSString *reRegex = @"[^\\p{DecimalNumber}]*(\\d+)", *re2Regex = @"\\d+";
int value = 0;
NSUInteger x = 0UL;
{
NSCharacterSet *digits = [NSCharacterSet decimalDigitCharacterSet];
NSCharacterSet *nonDigits = [digits invertedSet];
NSScanner *scanner = [NSScanner scannerWithString:value2String];
NSString *csIntString = [valueString stringByTrimmingCharactersInSet:nonDigits];
NSString *reString = [valueString stringByMatching:reRegex capture:1L];
NSString *re2String = [valueString stringByMatching:re2Regex];
[scanner scanInt:&value];
NSLog(@"digits : %p, size: %lu", digits, malloc_size(digits));
NSLog(@"nonDigits : %p, size: %lu", nonDigits, malloc_size(nonDigits));
NSLog(@"scanner : %p, size: %lu, int: %d", scanner, malloc_size(scanner), value);
NSLog(@"csIntString : %p, size: %lu, '%@' int: %d", csIntString, malloc_size(csIntString), csIntString, [csIntString intValue]);
NSLog(@"reString : %p, size: %lu, '%@' int: %d", reString, malloc_size(reString), reString, [reString intValue]);
NSLog(@"re2String : %p, size: %lu, '%@' int: %d", re2String, malloc_size(re2String), re2String, [re2String intValue]);
NSLog(@"intValue : %d", [value2String intValue]);
NSLog(@"rklIntValue : %d", [valueString rklIntValue]);
}
for(x = 0UL, startTime = cpuTimeUsed(); x < 100000UL; x++) { value = [[valueString stringByTrimmingCharactersInSet:[[NSCharacterSet decimalDigitCharacterSet] invertedSet]] intValue]; } csTime = (cpuTimeUsed() - startTime) / (double)x;
for(x = 0UL, startTime = cpuTimeUsed(); x < 100000UL; x++) { value = [[valueString stringByMatching:reRegex capture:1L] intValue]; } reTime = (cpuTimeUsed() - startTime) / (double)x;
for(x = 0UL, startTime = cpuTimeUsed(); x < 100000UL; x++) { value = [[valueString stringByMatching:re2Regex] intValue]; } re2Time = (cpuTimeUsed() - startTime) / (double)x;
for(x = 0UL, startTime = cpuTimeUsed(); x < 100000UL; x++) { value = [valueString rklIntValue]; } rklTime = (cpuTimeUsed() - startTime) / (double)x;
for(x = 0UL, startTime = cpuTimeUsed(); x < 100000UL; x++) { value = [value2String intValue]; } ivTime = (cpuTimeUsed() - startTime) / (double)x;
for(x = 0UL, startTime = cpuTimeUsed(); x < 100000UL; x++) { [[NSScanner scannerWithString:value2String] scanInt:&value]; } scTime = (cpuTimeUsed() - startTime) / (double)x;
NSLog(@"csTime : %.5lfus", csTime * 1000000.0);
NSLog(@"reTime : %.5lfus", reTime * 1000000.0);
NSLog(@"re2Time: %.5lfus", re2Time * 1000000.0);
NSLog(@"scTime : %.5lfus", scTime * 1000000.0);
NSLog(@"ivTime : %.5lfus", ivTime * 1000000.0);
NSLog(@"rklTime: %.5lfus", rklTime * 1000000.0);
[NSString clearStringCache];
[pool release]; pool = NULL;
return(0);
}