I'm trying to evaluate different substring search (ala strstr) algorithms and implementations and looking for some well-crafted needle and haystack strings that will catch worst-case performance and possible corner-case bugs. I suppose I could work them out myself but I figure someone has to have a good collection of test cases sitting around somewhere...
views:
117answers:
4Doesn't answer your question directly, but you may find the algorithms in the book - Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology - interesting (has many novel algorithms on sub-string search). Additionally, it is also a good source of special and complex cases.
Some thoughts and a partial answer to myself:
Worst case for brute force algorithm:
a^(n+1) b
in (a^n b)^m
e.g. aaab
in aabaabaabaabaabaabaab
Worst case for SMOA:
Something like yxyxyxxyxyxyxx
in (yxyxyxxyxyxyxy)^n
. Needs further refinement. I'm trying to ensure that each advancement is only half the length of the partial match, and that maximal suffix computation requires the maximal amount of backtracking. I'm pretty sure I'm on the right track because this type of case is the only way I've found so far to make my implementation of SMOA (which is asymptotically 6n+5
) run slower than glibc's Two-Way (which is asymptotically 2n-m
but has moderately painful preprocessing overhead).
Worst case for anything rolling-hash based:
Whatever sequence of bytes causes hash collisions with the hash of the needle. For any reasonably-fast hash and a given needle, it should be easy to construct a haystack whose hash collides with the needle's hash at every point. However, it seems difficult to simultaneously create long partial matches, which are the only way to get the worst-case behavior. Naturally for worst-case behavior the needle must have some periodicity, and a way of emulating the hash by adjusting just the final characters.
Worst case for Two-Way:
Seems to be very short needle with nontrivial MS decomposition - something like bac
- where the haystack contains repeated false positives in the right-half component of the needle - something like dacdacdacdacdacdacdac
. The only way this algorithm can be slow (other than by glibc authors implementing it poorly...) is by making the outer loop iterate many times and repeatedly incur that overhead (and making the setup overhead significant).
Other algorithms:
I'm really only interested in algorithms that are O(1)
in space and have low preprocessing overhead, so I haven't looked at their worst cases so much. At least Boyer-Moore (without the modifications to make it O(n)
) has a nontrivial worst-case where it becomes O(nm)
.
A procedure that might give interesting statistics, though I have no time to test right now:
Randomize over string length, then randomize over string contents of that length, then randomize over offset/length of a substring (possibly something not in the string), then randomily clobber over the substring (possibly not at all), repeat.
You can generate container strings (resp., contained test values) recursively by:
Starting with the empty string, generate all strings given by the augmentation of a string currently in the set by adding a character from an alphabet to the left or the right (both).
The alphabet for generating container strings is chosen by you.
You test 2 alphabets for contained strings. One is the one that makes up container strings, the other is its complement.