ansaurus

Question

Fastest way to do a case-insensitive substring search in C/C++?

Answer 1

+2 A:

Why do you use _strlwr(string); in init_stristr()? It's not a standard function. Presumably it's for locale support, but as it's not standard, I'd just use:

char_table[i] = tolower(i);

Chris Young 2008-10-17 09:39:45

It's a special function to handle locale settings correctly. It's Windows-specific, but so is the application this is used in, so portability was not an issue at the time (ref. http://msdn.microsoft.com/en-us/library/hkxwh33z(VS.71).aspx).

Anders Sandvig 2008-10-17 09:51:03

Answer 2

+8 A:

 * I deliberately chose not to comment it.  You should have at least
 * as much fun trying to understand it, as I had to write it :-).

Super ego power, I choose you!

This is the slowest implementation I've ever seen. Meaning that it takes longer to understand than any other implementation I've had the ahem 'joy' ahem of maintaining.

Adam Davis 2008-10-17 09:40:29

Answer 3

+1 A:

*Shudder... goto statments.

Konrad 2008-10-17 10:07:16

Not *all* goto statements are considered harmful. :-)

Head Geek 2008-10-17 17:31:23

Just the ones used in C++ code? :D

Konrad 2008-10-18 21:07:17

Answer 4

+1 A:

I'd advice you to take some of the common strcasestr implementation that already exists. For example of glib, glibc, OpenBSD, FreeBSD, etc. You can search for more with google.com/codesearch. You can then make some performance measurements and compare the different implementation.

quinmars 2008-10-17 10:37:56

Answer 5

+5 A:

The code you posted is about half as fast as strcasestr.

$ gcc -Wall -o my_stristr my_stristr.c
steve@solaris:~/code/tmp
$ gcc -Wall -o strcasestr strcasestr.c 
steve@solaris:~/code/tmp
$ ./bench ./my_stristr > my_stristr.result ; ./bench ./strcasestr > strcasestr.result;
steve@solaris:~/code/tmp
$ cat my_stristr.result 
run 1... time = 6.32
run 2... time = 6.31
run 3... time = 6.31
run 4... time = 6.31
run 5... time = 6.32
run 6... time = 6.31
run 7... time = 6.31
run 8... time = 6.31
run 9... time = 6.31
run 10... time = 6.31
average user time over 10 runs = 6.3120
steve@solaris:~/code/tmp
$ cat strcasestr.result 
run 1... time = 3.82
run 2... time = 3.82
run 3... time = 3.82
run 4... time = 3.82
run 5... time = 3.82
run 6... time = 3.82
run 7... time = 3.82
run 8... time = 3.82
run 9... time = 3.82
run 10... time = 3.82
average user time over 10 runs = 3.8200
steve@solaris:~/code/tmp

The main function was:

int main(void)
{
        char * needle="hello";
        char haystack[1024];
        int i;

        for(i=0;i<sizeof(haystack)-strlen(needle)-1;++i)
        {
                haystack[i]='A'+i%57;
        }
        memcpy(haystack+i,needle, strlen(needle)+1);
        /*printf("%s\n%d\n", haystack, haystack[strlen(haystack)]);*/
        init_stristr();

        for (i=0;i<1000000;++i)
        {
                /*my_stristr(haystack, needle);*/
                strcasestr(haystack,needle);
        }


        return 0;
}

It was suitably modified to test both implementations. I notice as I am typing this up I left in the init_stristr call, but it shouldn't change things too much. bench is just a simple shell script:

#!/bin/bash
function bc_calc()
{
        echo $(echo "scale=4;$1" | bc)
}
time="/usr/bin/time -p"
prog="$1"
accum=0
runs=10
for a in $(jot $runs 1 $runs)
do
        echo -n "run $a... "
        t=$($time $prog 2>&1| grep user | awk '{print $2}')
        echo "time = $t"
        accum=$(bc_calc "$accum+$t")
done

echo -n "average user time over $runs runs = "
echo $(bc_calc "$accum/$runs")

freespace 2008-10-17 11:56:02

Thank you for the comparison. That's very interesting. My code is from 2003, when strcasstr() did not exist—or at least I didn't know about it (it was added in 2005 according to the glibc CVS history). It seems strcasestr() is not provided by MSVC++, but maybe I can port it from glibc.

Anders Sandvig 2008-10-17 12:07:56

Answer 6

+1 A:

Assuming both input strings are already lowercase.

int StringInStringFindFirst(const char* p_cText, const char* p_cSearchText)
{
    int iTextSize = strlen(p_cText);
    int iSearchTextSize = strlen(p_cSearchText);

    char* p_cFound = NULL;

    if(iTextSize >= iSearchTextSize)
    {
     int iCounter = 0;
     while((iCounter + iSearchTextSize) <= iTextSize)
     {
      if(memcmp( (p_cText + iCounter), p_cSearchText, iSearchTextSize) == 0)
       return  iCounter;
      iCounter ++;
     }
    }

    return -1;
}

You could also, try using masks... if for example most of the strings you are going to compare only contains chars from a to z, maybe it's worth to do something like this.

long GetStringMask(const char* p_cText)
{
    long lMask=0;

    while(*p_cText != '\0')
    {  
     if (*p_cText>='a' && *p_cText<='z')
      lMask = lMask | (1 << (*p_cText - 'a') );
     else if(*p_cText != ' ')
     {
      lMask = 0;
      break;  
     }

     p_cText ++;
    }
    return lMask;
}

Then...

int main(int argc, char* argv[])
{

    char* p_cText = "this is a test"; 
    char* p_cSearchText = "test";

    long lTextMask = GetStringMask(p_cText);
    long lSearchMask = GetStringMask(p_cSearchText);

    int iFoundAt = -1;
    // If Both masks are Valid
    if(lTextMask != 0 && lSearchMask != 0)
    {
     if((lTextMask & lSearchMask) == lSearchMask)
     {  
       iFoundAt = StringInStringFindFirst(p_cText, p_cSearchText);
     }
    }
    else
    {
     iFoundAt = StringInStringFindFirst(p_cText, p_cSearchText);
    }


    return 0;
}

João Augusto 2008-10-17 12:12:48

I already tried various implementations where I would convert the string to lower case before comparison, but it turned out to be slower in cases when you are searching for a short string within a long string.

Anders Sandvig 2008-10-17 12:17:32

Also, if both strings are the same case, you can just use strstr()... ;)

Anders Sandvig 2008-10-17 12:18:43

Answer 7

+2 A:

The kinds of programmer you do not want to have the misfortune to work with, write comments like these:

My personal strstr() implementation that beats most other algorithms. * Until someone tells me otherwise, I assume that this is the * fastest implementation of strstr() in C. * I deliberately chose not to comment it. You should have at least * as much fun trying to understand it, as I had to write it :-).

Mitch Wheat 2008-10-17 12:23:41

Answer 8

A:

As someone always told me, "Green lines are your friend" !

( Unless you messed with your IDE and comments aren't green :) )

João Augusto 2008-10-17 12:31:08

Some might say that code should be self-commenting! ;)

Mitch Wheat 2008-10-17 12:33:21

True, but we all know that not all code can be "self-commenting" :)

João Augusto 2008-10-17 12:41:08

Answer 9

A:

plinth 2008-10-17 13:36:56

If you're dealing with ASCII, you need only 128 entries. ASCII stops at 127, unlike bytes. Which is why there are 500 extensions to ASCII. Not that it really matters, this is 2008 and the world uses Unicode now

MSalters 2008-10-17 13:59:03

Answer 10

+1 A:

use boost string algo. It is available, cross platform, and only a header file (no library to link in). Not to mention that you should be using boost anyway.

#include <boost/algorithm/string/find.hpp>

const char* istrstr( const char* haystack, const char* needle )
{
   using namespace boost;
   iterator_range<char*> result = ifind_first( haystack, needle );
   if( result ) return result.begin();

   return NULL;
}

caspin 2008-10-17 19:28:29

Answer 11

A:

If you can control the needle string so that it is always in lower case, then you can write a modified version of stristr() to avoid the lookups for that, and thus speed up the code. It isn't as general, but it can be faster - slightly faster. Similar comments apply to the haystack, but you are more likely to be reading the haystack from sources outside your control for you cannot be certain that the data meets the requirement.

Whether the gain in performance is worth it is another question altogether. For 99% of applications, the answer is "No, it is not worth it". Your application might be one of the tiny minority where it matters. More likely, it is not.

Jonathan Leffler 2008-10-19 06:56:39

ansaurus

tags:

views:

answers:

Fastest way to do a case-insensitive substring search in C/C++?

related questions