ansaurus

Question

What is the fastest way to decide if a digit appears in a string?

Answer 1

+7 A:

Any algorithm will be O(N).

I assume isdigit is pretty efficient already.

Brian R. Bondy 2009-04-15 18:23:14

At *least* O(N) :) It's easy to implement pessimizations and "smart" algorithms :)

Mihai Limbășan 2009-04-15 18:25:43

Good point, you could have some less efficient algorithms, but they'd probably be much harder to implement.

Brian R. Bondy 2009-04-15 18:28:02

Kibbee 2009-04-15 18:37:38

isdigit() is usually a table lookup, i.e. faster than two comparisons. And locale-safe to boot. ;-)

DevSolar 2009-04-15 18:38:34

isdigit is typically a bitmask operation - and quicker than the two-part test.

Jonathan Leffler 2009-04-15 18:41:33

updated to remove mention of the 2 part comparison.

Brian R. Bondy 2009-04-15 19:07:34

+1 in that form - it is usually hard to beat the standard library in efficiency.

DevSolar 2009-04-16 09:06:12

Answer 2

+2 A:

Conceptually there is no faster way. This is assuming that you have a string which where the placement of digits is seemingly random. This forces you to search every item in the string for a digit hence front to back is as likely to find the digit first as any other search mechanism.

JaredPar 2009-04-15 18:24:29

ya, there is no pattern to the position of digits which would lend itself to an educated, or cache based guess.

EvilTeach 2009-04-15 18:40:31

Answer 3

+1 A:

if you really want to reduce the overhead time and you don't mind making it specific for char, then you can check the ascii values between 0 and 9 inclusive.

48 to 57 decimal

this removes a stack call.

I should have said lookup table as well...

Tim 2009-04-15 18:26:49

That's exactly what isdigit does, and you can just make your own inline isdigit implementation. but yeah.

2009-04-15 18:30:19

Scratch that, the comments to Bondy's answer refute me.

2009-04-15 18:42:37

Answer 4

+1 A:

As others have said, you can't get below O(N).

I can think of a contrived scenario with logarithmic speed... Say you're writing a text editor and it needs a "does this file contain any digits" feature. You can keep a sorted array of all unique characters present in the file, update it on every keystroke, and query it with binary search. This probably lies outside the scope of your question though (and can be done in several better ways).

2009-04-15 18:28:47

Answer 5

+1 A:

You could go multithreaded, although that's probably adding too much complexity to the algorithm for something that's already pretty fast.

Kibbee 2009-04-15 18:35:21

Good point, but for all but the longest strings, this is going to be a pessimization due to the overhead of threads.

2009-04-15 18:41:37

You could also hurt yourself with threads due to horrible caching effects. For stuff like this one would hope the real limiter would be memory latency/bandwidth, and using multiple cores could potentially make it worse.

Christopher Smith 2009-04-15 18:44:40

If the strings are big enough, sounds like a winner.

Milhous 2009-04-16 03:02:46

Answer 6

+9 A:

There isn't a faster algorithm, but you can look at a full register's worth of bytes per instruction or use SIMD operations to speed things up. You can either use a mask and zero test to see if it is even possible if there are any digits in a range, or if your SIMD operations are fast enough on large enough vectors, you could do iterate through tests for specific numerical values in a vector of bytes faster than doing character compares.

So, for example, you could do something like:

byte buffer[8] = { 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30 };
uint64 *mask = (uint64 *) buffer; //this is just for clarity

if (*((uint64 *) s) & *mask) == 0)
    //You now don't need to do the < '0' test for the next 8 bytes in s

Some optimizers might be smart enough to do this for you just from your code sample above.

You had better be comparing a TON of bytes to be thinking about optimizations at this level though.

Christopher Smith 2009-04-15 18:43:05

This tells you if there are no chars 0..., P..., p... What do you do then?

Mike Dunlavey 2009-04-15 19:49:58

Surely it tells you that there are no numbers in the current 8 numbers you're looking at, so you skip to the next 8 numbers.

justinhj 2009-04-15 20:24:37

Oh, OK. Zero means there are no numbers in that set (and no letters >='P' or 'p'), but non-zero does not mean there are numbers. So this saves significant time if letters >= 'P' or 'p' are rarely used.

Mike Dunlavey 2009-04-15 23:03:11

All digits have 30 in the left nibble. if you and them and compare it to the mask and they are equal, then it is a digit.

EvilTeach 2009-04-16 19:29:39

@Evil: Understand, but that's not what this code does. It only tells you if the string is like "AbCdEfGh", i.e. only letters and only letters that come before P or p. Or am I missing something?

Mike Dunlavey 2009-04-16 22:01:57

EvilTeach 2009-04-18 01:20:29

The code above, does this 8 bytes at a time instead of 1.

EvilTeach 2009-04-18 01:21:38

Right, and if you could do 16 bytes at a time, then it'd be faster to have a specific mask for detecting each number, and do it 10 times, which would be faster than comparing each byte individually. You could also just have a second mask that sets an upper bound on the range, so at that point you'd already have a very high probability of finding a number when you looked.

Christopher Smith 2009-04-23 03:36:01

Answer 7

+3 A:

The real question is how important is it for this function to be optimal? I say leave the simple solution and profile it. You should only optimize it if it is causing a bottleneck in your code.

Patrick McDonald 2009-04-15 18:45:41

Answer 8

+1 A:

Just for fun, maybe something allong the lines of:

 // accumulator
unsigned char thereIsADigit = 0;
// lookup table
unsigned char IsDigit[256] = {0,0,0 ..., 1,1,1,1,1,1,1,1,1,0,0,0 ...};

// an unrolled loop, something like:
thereIsADigit |= IsDigit[s[0]];
thereIsADigit |= IsDigit[s[1]];
thereIsADigit |= IsDigit[s[2]];
thereIsADigit |= IsDigit[s[3]];
thereIsADigit |= IsDigit[s[4]];
thereIsADigit |= IsDigit[s[5]];
thereIsADigit |= IsDigit[s[6]];
thereIsADigit |= IsDigit[s[7]];
if (thereIsADigit) break;
s += 8;

On the IBM 360 there was a "translate" instruction that could do this in one step.

OK, OK, Christopher Smith's answer got me thinking. Suppose you are only using 7-bit ASCII. Here's a way to do SIMD with wide-integer arithmetic.

Suppose C is a 32-bit word containing 4 characters.

 // compare by subtracting in 8-bit 2s complement arithmetic
( (C + ((0x3a3a3a3a ^ 0x7f7f7f7f) + 0x01010101)) // set high bit for any char <= '9'
  &
  (0x2f2f2f2f + ((C ^ 0x7f7f7f7f) + 0x01010101)) // set high bit for any char >= '0'
) // high bit is set for any char <= '9' and >= '0'
& 0x80808080 // look only at the high-order bits
// if any of these 4 bits is 1, there is a digit in C
// if the result is zero, there are no digits in C

It depends on the high-order bit of each character being initially zero, so that a carry into that bit will not propogate. (I'm sure this could be simplified.)

Mike Dunlavey 2009-04-15 19:16:08

Ya, the input set is 7 bit ascii.

EvilTeach 2010-03-17 20:56:31

Answer 9

+11 A:

I would start by using the appropriate library function, strcspn, on the assumption that the library has been optimized with extreme prejudice:

#include <string.h>
#include <stdio.h>

int digit_exists_in(const char *s)
{
    return s[strcspn(s, "0123456789")] != '\0';
}

int main(void)
{
    printf("%d\n", digit_exists_in("foobar"));
    printf("%d\n", digit_exists_in("foobar1"));
    return 0;
}

If the library hasn't been optimized sufficiently, it'd be a good idea to put the optimization into the library so everyone can benefit. (You have the source, right?)

Lars Wirzenius 2009-04-15 19:33:59

Interesting. I rejected strspn as the doc seemed to suggest that if the first character was not in the search set then the function would return 0.

EvilTeach 2009-04-16 16:11:20

strspn return an integer value specifying the length of the substring in string that consists entirely of characters in strCharSet. If string begins with a character not in strCharSet, the function returns 0. No return value is reserved to indicate an error.

EvilTeach 2009-04-16 16:14:07

I will revisit strspn and strcspn. +1

EvilTeach 2009-04-16 16:15:10

Answer 10

A:

Have a human look at it. Humans can do this in O(1) time, we have MUCH larger word sizes than even modern processors.

That said, actual time would still be better with your method...what with the difference in cycle time between a modern core and a human brain.

Jeff 2009-04-15 19:48:45

Not to mention the caloric load (twinkies/invocation :)

Mike Dunlavey 2009-04-15 23:10:40

A downvote? Seriously?

Jeff 2009-04-16 13:57:57

++ for a sense of humor

Mike Dunlavey 2009-04-16 22:10:56

Answer 11

A:

I may be wrong, but there may be a faster way.

Quicksort the string.

The serial search has a best time of O(1), a mean of O(1/2n) and worse case of O(n).

Quicksort has best O(log n), mean O(nlog n), worse case O(n^2).

The thing is, you can bail out of the quicksort as soon as you see a digit. If the quicksort actually completes, the digit will be at the beginning of the sorted string, so you'll find it then in O(1).

The achievement here it to shift the best, mean and worse case behaviours. A quicksort will have worse worst case behaviour, but better mean behaviour.

Blank Xavier 2009-04-15 19:59:49

P.s. this post is a joke :-)

Blank Xavier 2009-04-15 20:00:17

Good thinking. I prefer merge sort, which is always NlogN. Of course, that's only logN times slower than N :)

Mike Dunlavey 2009-04-15 23:07:42

Would it be quicker to use XML ? ;-)

Martin Beckett 2009-04-17 20:56:01

If you're going to BS, do it properly. Anyone can see 1/2 n is better than n log n.

Matthew Flaschen 2009-04-18 05:27:16

@Matthew - O(nlog n) is the mean for a *completed* quicksort. However, in this method, the quicksort usually does not need to complete.

Blank Xavier 2009-04-18 07:02:51

Answer 12

A:

Of course, you can sacrifice accuracy for speed:

int digit_exists_in(const char* s)
{
    return 0;
}

This algorithm has a complexity of O(1) and an approximacity of O((246/256)^N).

aib 2009-04-16 00:55:21

I like your Level 1 Shield of Community Wiki.

ryeguy 2009-04-17 20:43:34

Agreed on the shield! The algorithm can be improved to approximacity of O((246/256)^(N-1)) with no increase in complexity (in big-O) by replacing the "return 0" with "return isdigit(*s)"

DocMax 2009-04-17 21:34:05

Thanks, it was so I couldn't gain any reputation from a joke. (Though now I think it works both ways.)

aib 2009-04-19 00:17:35

Answer 13

+17 A:

liw.fi is right, by the way. I was a bit surprised by this, since strcspn has to do a much more general problem than the isdigit() approach, but it seems to be the case:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <assert.h>

#define NTESTS 10000
#define TESTSIZE 10000

char stest1[TESTSIZE];
char stest2[TESTSIZE];

int test_isdigit(char *s) {
    while (*s) {
        if (isdigit(*s)) return 1;
        s++;
    }
    return 0;
}

int test_range(char *s) {
    while (*s) {
        if ((*s >= '0') && (*s <= '9')) return 1;
        s++;
    }
    return 0;
}

int test_strcspn(char *s) {
    return s[strcspn(s, "0123456789")] != '\0';
}

int main(int argc, char **argv) {
    long int i;
    for (i=0; i<TESTSIZE; i++) {
        stest1[i] = stest2[i] = 'A' + i % 26;
    }
    stest2[TESTSIZE-1] = '5';

    int alg = atoi(argv[1]);

    switch (alg) {
        case 0:        
            printf("Testing strcspn\n");
            for (i=0; i<NTESTS; i++) {
                assert(test_strcspn(stest1) == 0);
                assert(test_strcspn(stest2) != 0);
            }
            break;
        case 1:
            printf("Testing isdigit() loop\n");
            for (i=0; i<NTESTS; i++) {
                assert(test_isdigit(stest1) == 0);
                assert(test_isdigit(stest2) != 0);
            }
            break;
        case 2:
            printf("Testing <= => loop\n");
            for (i=0; i<NTESTS; i++) {
                assert(test_range(stest1) == 0);
                assert(test_range(stest2) != 0);
            }
            break;
        default:
            printf("eh?\n");
            exit(1);
    }    

    return 0;
}

It's awfully hard to beat the standard libraries at their own game ... (the usual caveats apply -- YMMV)

$ gcc -O6 -Wall -o strcspn strcspn.c 

$ time ./strcspn 0
Testing strcspn

real    0m0.085s
user    0m0.090s
sys 0m0.000s

$ time ./strcspn 1
Testing isdigit() loop

real    0m0.753s
user    0m0.750s
sys 0m0.000s

$ time ./strcspn 2
Testing <= => loop

real    0m0.247s
user    0m0.250s
sys 0m0.000s

UPDATE: Just for fun, I added a bitmap lookup version based on Mike Dunlavey's answer:

char bitmap[256] = {
        /* 0x00 */ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* 0x10 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* 0x20 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* 0x30 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
};

int test_bitmap(char *s) {
    while (!bitmap[*(unsigned char *)s]) s++;
    return (*s);
}

Which slightly outperforms the others (~.170s) but still can't touch strcspn!

Sharkey 2009-04-16 01:19:58

+1 for benchmarking and reporting.

RBerteig 2009-04-16 01:38:37

I get similar results, FWIW. test_range() and test_digit() execute almost identically, with strcspn being the clear 'winner'.

Nick Presta 2009-04-16 02:35:56

Here on OS X (Leopard), test_digit() performs pretty awfully (2.281 seconds real time), whereas test_range() performs alright (0.686 seconds) and test_strcspn() is still the clear winner (0.275 seconds). The system time for all of them is 0.003 to 0.006, but I notice the lag time for test_isdigit().

Chris Lutz 2009-04-16 06:26:23

Ya. Using numbers and facts to back up your argument is cheating +1

EvilTeach 2009-04-16 16:19:31

Very well done, actual benchmarking is always useful. I doubt there's a huge variation between compilers/architectures in this, but just in case: anyone wanting to use this code should run the benchmark themselves.

Lars Wirzenius 2009-04-17 07:15:27

Some insight into why strcpsn is faster: Apparently it uses a lookup table (i.e. a boolean value for all 256 possible characters) instead of >= and <=. (Maybe they have some low-level assembly hacks too :P)

v3 2009-04-23 22:02:34

Answer 14

+1 A:

Here's a version that may or may not be faster, but it handles a NULL pointer...

int digit_exists_in(const char *s)
{
    if (!s)
        return (0);
    while (*s)
        if (isdigit(*s++))
            return (1);
    return (0);
}

dwc 2009-04-16 01:43:42

Answer 15

A:

Memory Prefetch

If your strings are very long, either get your compiler to do it or manually unroll your loop and drop in a memory prefetch instruction or two every cache-line.

That way while the CPU is scanning, the memory controller can be pulling in the next lines of data.

If you save the length of the string when you create it you can skip all the checks for the NUL byte, which means you can unroll your loop to operate in bigger chunks and reduce the number of compare and branch operations, although with current branch predictors, it honestly doesn't make much difference.

Even with great CPU branch predictors, the loop will slow down if it has to check the loop counter every time through the loop to decide when to do a memory prefetch so unrolling is still helpful in that case.

Profiling Feedback

For best performance the CPU does need to get the branches hinted correctly, and that's where profiling feedback comes in very handy. Otherwise the compiler is just making a somewhat educated guess.

Zan Lynx 2009-04-16 02:28:12

Answer 16

A:

Taking the test program and throwing my profiler at it yields the following.

      Count      %   % with           Time   Statement
                      child
-----------------------------------------------------------------------------------------------
                                             int test_isdigit(char *s)   
     20,000    0.0    0.0          2,160.4   {   
199,990,000   13.2    9.5     14,278,460.7       while (*s)  
                                                 {   
199,980,000   69.8   49.9     75,243,524.7            if (isdigit(*s)) return 1;  
199,970,000   17.0   12.1     18,312,971.5            s++;    
                                                 }   

     10,000    0.0    0.0          1,151.4       return 0;   
                                             }   

                                             int test_range(char *s)     
     20,000    0.0    0.0          1,734.2   {   
199,990,000   33.6    9.4     14,114,309.7       while (*s)  
                                                 {
199,980,000   32.2    9.0     13,534,938.6           if ((*s >= '0') && 
                                                        (*s <= '9')) return 1;   
199,970,000   34.2    9.5     14,367,161.9           s++;    
                                                 }   

     10,000    0.0    0.0          1,122.2       return 0;   
                                             }   

                                             int test_strcspn(char *s)   
     20,000    0.2    0.0          1,816.6   {   
     20,000   99.8    0.6        863,113.2       return s[strcspn(s, "0123456789")] 
                                                          == '0'; 
                                             }

strcspn does the job well enough. Looking at the asm code for it, I see it builds a bitmap of size 256 sets the bits based on the search characters, then processes the string.

The bit map gets built on the stack once for every call.

One other approach would be to build and keep the bit map, and reused it each time.

Another approach would be to do the operations in parallel using the techniques that Chris Smith talked about.

For the moment strcspn will suffice.

EvilTeach 2009-04-18 04:03:52

ansaurus

tags:

views:

answers:

What is the fastest way to decide if a digit appears in a string?

Memory Prefetch

Profiling Feedback

related questions