views:

164

answers:

6

If I have:

const char *mystr = "cheesecakes";
FILE *myfile = fopen("path/to/file.exe","r");

I need to write a function to determine whether myfile contains any occurrences of mystr. Could anyone help me? Thanks!

UPDATE: So it turns out the platform I need to deploy to doesn't have memstr. Does anyone know of a free implementation I can use in my code?

+2  A: 

If you can't fit the whole file into memory, and you have access to the GNU memmem() extension, then:

  • Read as much as you can into a buffer;
  • Search the buffer with memmem(buffer, len, mystr, strlen(mystr) + 1);
  • Discard all but the last strlen(mystr) characters of the buffer, and move those to the start;
  • Repeat until end of file reached.

If you don't have memmem, then you can implement it in plain C using memchr and memcmp, like so:

/*
 * The memmem() function finds the start of the first occurrence of the
 * substring 'needle' of length 'nlen' in the memory area 'haystack' of
 * length 'hlen'.
 *
 * The return value is a pointer to the beginning of the sub-string, or
 * NULL if the substring is not found.
 */
void *memmem(const void *haystack, size_t hlen, const void *needle, size_t nlen)
{
    int needle_first;
    const void *p = haystack;
    size_t plen = hlen;

    if (!nlen)
        return NULL;

    needle_first = *(unsigned char *)needle;

    while (plen >= nlen && (p = memchr(p, needle_first, plen - nlen + 1)))
    {
        if (!memcmp(p, needle, nlen))
            return (void *)p;

        p++;
        plen = hlen - (p - haystack);
    }

    return NULL;
}
caf
Thanks; I ended up loading the whole thing into memory, then using your memmem to find the needle.
igul222
+3  A: 

Because there is no memmem or memstr to find a string in a binary array (others suggested to read it into memory and use strstr - no this doesn't work) you have to read it byte by byte with "fgetch" and write a small state machine to match it while reading.

Lothar
If you are on Linux and don't care about portability, glibc does have memmem.
R Samuel Klatchko
Loading it byte by byte will hurt performance. Load it into memory, then you search for it in memory.
EvilTeach
A: 
 chat match = "findthis";
 int depth = 0;
 while(not eof)
 {
     char ch = getonebyte();
     if(ch == match[depth])
     {  
         if (depth == strlen(match))
            break;
         else
            depth++;
      }
      else 
           depth = 0;
 }

roughly (I am sure there are off by ones in there)

pm100
Good try. Unfortunately this has an edge case where it will fail when you have a search string like "112" and in the file you have "1112"
R Samuel Klatchko
i dont think so, i start at the beginning of the match string when i hit a mismatch
pm100
pm100, yes you reset your needle index, but you've already lost the part of the haystack you need.
Matthew Flaschen
@pm100 - you think wrong. Yes, you reset depth and start at the beginning of the match string, but you don't seek back in the data stream. Try testing it (of course, you'll actually have to make it work for the non-edge case first).
R Samuel Klatchko
what a moron :-( me I mean
pm100
A: 

Here is a slapped together version of it. It has no error checking and probably has overflow bugs. But I think it finds the desired string and accounts for the backtracking necessary for partial substring matches. I doubt there are more than 15 bugs left.

Edit: There was at least one in the first answer. I woke up in the middle of the night and realized the backtracking check was wrong. It didn't find '12123' in '1212123'. It might still be wrong, but at least it finds that one now.

int main( int argc, char* argv[] )
{
    FILE *fp;
    char *find, *hist;
    int  len, pos=0, hl=0, i;
    char c;

    fp = fopen( argv[1], "r" );
    find = argv[2];
    len = (int)strlen( find );
    hist = malloc( len );
    memset( hist, 0, len );
    while ( !feof( fp )) {
        c = fgetc( fp );
        if ( find[pos++] == c ) {
            if ( pos == len ) {
                printf( "Found it\n" );
                return 1;
            }
        }
        else {
            // check history buffer (kludge for backtracking)
            if ( pos > 0 ) {
                pos = 0;
                for ( i = 0; i < len - 1; i++ )
                    if ( 0 == memcmp( hist+len-i-1, find, i + 1 )) {
                    // we had a mismatch, but the history matches up to len i
                    pos = i;
                }
            }
        }
        // update history buffer - this is innefficient - better as circular buffer
        memmove( hist, hist + 1, len - 1 );
        hist[len-1] = c;
    }
    printf( "Not found\n" );
}
Mark Wilkins
+1  A: 

See here:

http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

for a Boyer-Moore implementation in C99. This is a very common string searching algorithm and runs in O(n).

Demi
A: 

Here's a function that will search for a string in a buffer.

Limitations: it doesn't handle wide characters (in the case of internationalization). You'll have to write your own code to read the file into memory. It won't find the pattern if the pattern is split between 2 read buffers.

/*****************************************************
const char *buffer    pointer to your read buffer (the larger, the better).
size_t bufsize        the size of your buffer
const char *pattern   pattern you are looking for.

Returns an index into the buffer if pattern is found.
-1 if pattern is not found.

 Sample:
    pos = findPattern (buffer, BUF_SIZE, "cheesecakes");
*****************************************************/

int findPattern (const char *buffer, size_t bufSize, const char *pattern)
{
    int i,j;
    int patternLen;

    // minor optimization. Determine patternLen so we don't 
    // bother searching buffer if fewer than patternLen bytes remain.
    patternLen = strlen (pattern);

    for (i=0; i<bufSize-patternLen; ++i)
    {
        for (j=0; j<patternLen; ++j)
        {
            if (buffer[i+j] != pattern[j])
            {
                break;
            }
        }
        if (j == patternLen)
        {
            return i;
        }
    }
    return -1;
}
Mike Yam