There's an extensive discussion of a large number of string searching algorithms at http://www-igm.univ-mlv.fr/~lecroq/string/, with illustrative C code and references.
There's a discussion in one set of comments about the costs of the algorithms. One of the points to bear in mind is that if you can amortize the cost of setup over many invocations of the search function, then the high-performance algorithms can give you enormous benefit. If you are going to be searching for different strings all the time, it is harder to win out.
I've got a version of the KMP (Knuth-Morris-Pratt) algorithm packaged for multiple reuse of the same search string. The header is:
/*
@(#)File: $RCSfile: kmp.h,v $
@(#)Version: $Revision: 1.4 $
@(#)Last changed: $Date: 2008/02/02 05:49:34 $
@(#)Purpose: Knuth-Morris-Pratt Search Algorithm
@(#)Author: J Leffler
@(#)Copyright: (C) JLSS 2005,2008
@(#)Product: :PRODUCT:
*/
#ifndef KMP_H
#define KMP_H
#include <stddef.h> /* size_t */
typedef struct kmp_control kmp_control;
/*
** To set up a search (to repeatedly look for the same search string in
** multiple scan strings), use kmp_setsearch(). To start a search on a
** new scan string, use kmp_settarget(). To find the next match of a
** given search string in a given target string, use kmp_search(). Note
** that kmp_setsearch() and kmp_settarget() do not copy the data in the
** source and target strings; the pointers must remain valid You can
** copy kmp_control structures for reuse if desired.
*/
typedef void *(*kmp_malloc)(size_t nbytes);
typedef void (*kmp_free)(void *data);
extern kmp_control *kmp_setsearch(const char *search, size_t schlen);
extern void kmp_settarget(kmp_control *ctrl, const char *target, size_t tgtlen);
extern const char *kmp_search(kmp_control *ctrl);
extern void kmp_release(kmp_control *ctrl);
extern void kmp_setalloc(kmp_malloc mem_alloc, kmp_free mem_free);
#endif /* KMP_H */
Being able to specify memory allocation functions is a tad unusual - but my code often works in an environment where memory allocation is not done via the standard malloc()
and so on, and you must be able to switch the memory allocator on demand. You can ignore the two typedefs and the corresponding function; the default settings are, of course, to use malloc()
and free()
.
The basic KMP algorithm code came from the site above - but was modified to allow me to set the search string once and then search multiple target strings, etc. Contact me (see my profile) for the source code. I have got a similar structure for Boyer-Moore code too (same original source), and also a case-insensitive Boyer-Moore code.
There's a good war story about strstr()
and performance in Kernighan and Pike's excellent book "The Practice of Programming".
I did some experimentation - using a copy of the King James Bible (4.8 MB) as the plain text, and memory mapping that. For many searches, the (MacOS X 10.6.2 / BSD) strstr()
was faster than either KMP or BM. When the strings grew long enough (12+ characters, approximately), then the BM algorithm finally outpaced strstr()
. The KMP algorithm always seemed to be much slower.
Morals?
- It is hard to out-pace a good library.
- KMP is much slower than BM on plausible English language strings.
And the infrastructure I put in place around the algorithms may be too heavy - but the alternative in the original code is a callback mechanism, which presents some problems for determining the context of matches.