views:

353

answers:

5

Hi,

Let's say I have the following content:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

How do I search for dummy or dummy text in that string using C? Is there any easy way to do it or only with strong string manipulation? All I need is to search for it and return a Boolean with the result.

EDIT:
You guys created a big discussion around this topic and suggested a few algorithms and I don't mind that cause this might be useful for someone else, or even me in the future. But what I really wanted was the most easy way to do it, no matter the time/space complexity. That doesn't really matter for what I'm doing. So, strstr easily and quickly fixed my problem. I really have to get me some standard C functions chet sheet.

+4  A: 

The standard library function for this is strstr:

char *strstr(const char *haystack, const char *needle);

It returns a pointer into the string where the match was found, or NULL if it wasn't - so if all you need is a boolean, just test the return value (if (strstr(...)).

Jefromi
And, strstr() is POSIX -- yeah! http://www.opengroup.org/onlinepubs/9699919799/
Kevin Little
@Kevin: doesn't being in the C standard library pretty much mean it's also in POSIX? (POSIX states that one of its goals is "alignment with the ISO/IEC 9899:1999 standard, including ISO/IEC 9899:1999/Cor.2:2004(E)")
Michael Burr
@Michael: I think you are correct, at least as far as the contents of "string.h" are concerned. I was just attempting to reinforce the "*standard* library function" concept that Jefromi was gently pushing, and cheer-leading for POSIX is a 20-year-plus habit that's hard to break! :)
Kevin Little
+2  A: 

You can use the strstr function if you want something simple and your strings aren't too long. If your strings are very long however, consider the KMP algorithm as it is a lot more efficient.

Edit:

I don't really like the wikipedia article, as the implementation there looks a bit weird to me (although it's probably correct) and it's also misleading about KMP's performance. I prefer the implementation and description given here and on other sites returned by a google search for "KMP algorithm".

IVlad
It's a lot more efficient... in some cases. Quote from the wikipedia article you linked: "Note that in practice, the KMP algorithm is not good at searching in natural language texts because it can only skip characters when the first part of the pattern actually matches a part of the text. In practice this only happens occasionally in natural language texts."
Jefromi
As far as I know, the `strstr` function's time complexity is `O(NM)`, while KMP's complexity is `O(N+M)`, so even if there are cases in which it doesn't behave as best as possible, it will still never reach quadratic time, so it should always be faster than `strstr`.
IVlad
@IVlad: You're right about the complexities, of course. I haven't done any real analysis, but here's the hand-waving argument. There are in reality constants in front of those big O's, and KMP's is larger, because of all the extra work it does. If KMP doesn't get to skip much (which it probably doesn't in natural language text), it's possible for it to perform worse on a set of natural language searches, despite it being better over all strings. Those are *average* complexities. Don't worry, you have my upvote, just trying to point out the gains aren't necessarily as big as they sound.
Jefromi
All right, I took a 180000 character text file (git's user-manual.txt), pulled a hundred random strings for each size from 5 to 15, and ran both strstr and KMP to search for each of them (many times). The difference in performance was less than the standard deviation of each. (I used this implementation: http://barnyard.syr.edu/quickies/kmp.c) I don't deny KMP is better in general; it clearly is, and for something like DNA strings, it'd be way way better. Just not so much for natural text.
Jefromi
Just as another data point, here's what Raymond Chen had to say about naive string searches vs. advanced string search algorithms: http://blogs.msdn.com/oldnewthing/archive/2006/01/19/514834.aspx
Michael Burr
There is no reason why the implementation of `strstr()` has to be a naive `O(NM)` algorithm. Indeed glibc uses the "Two-way" algorithm that guarantees linear performance: http://sourceware.org/git/?p=glibc.git;a=blob_plain;f=string/str-two-way.h;hb=HEAD
caf
@caf: Thanks for weighing in there - I figured there was a good change glibc had made a fairly informed decision, just wasn't sure where to go look so I trusted IVlad over my instincts. Good to know!
Jefromi
A: 

I would use strstr. (Also here).

I am not about the use of word "partial" in the question. The argument ("dummy" or "dummy text") has to be fully matched, right?

ArunSaha
A: 

I've always liked Boyer-Moore, myself. It is O(n), but must be setup (i.e., two tables must be precomputed.) Thus it is good if a lot of text is to be searched, or the search strings are known in advance, thus making up for the cost of building the tables. It is also best for 8-bit ASCII.

[http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm]

(BTW, is there a Unicode flavor of strstr()?)

Kevin Little
you don't nead a unicode version of strstr if needle and haystack are in the same encoding (and that encoding is ASCII compatible, ie UTF-8) It will byte compare each element. Of course, it won't do fancy things like equating è with e or é... Glib has utf8 string utility functions if you need the advanced stuff: http://library.gnome.org/devel/glib/2.24/glib-Unicode-Manipulation.html
Isak Savo
@Isak: not entirely true - `strstr()` would not work well on UTF-16 because of NUL bytes in basic characters. That's apart from the fact that you'd normally use `wchar_t` for that - and presumably `wcsstr()`. For UTF-8, basic `strstr()` works fine, though.
Jonathan Leffler
yeah you're right jonathan... that's what I was trying to say with "ascii compatible".. but it's worth clairifying anyway
Isak Savo
+1  A: 

There's an extensive discussion of a large number of string searching algorithms at http://www-igm.univ-mlv.fr/~lecroq/string/, with illustrative C code and references.

There's a discussion in one set of comments about the costs of the algorithms. One of the points to bear in mind is that if you can amortize the cost of setup over many invocations of the search function, then the high-performance algorithms can give you enormous benefit. If you are going to be searching for different strings all the time, it is harder to win out.

I've got a version of the KMP (Knuth-Morris-Pratt) algorithm packaged for multiple reuse of the same search string. The header is:

/*
@(#)File:           $RCSfile: kmp.h,v $
@(#)Version:        $Revision: 1.4 $
@(#)Last changed:   $Date: 2008/02/02 05:49:34 $
@(#)Purpose:        Knuth-Morris-Pratt Search Algorithm
@(#)Author:         J Leffler
@(#)Copyright:      (C) JLSS 2005,2008
@(#)Product:        :PRODUCT:
*/

#ifndef KMP_H
#define KMP_H

#include <stddef.h> /* size_t */

typedef struct kmp_control kmp_control;

/*
** To set up a search (to repeatedly look for the same search string in
** multiple scan strings), use kmp_setsearch().  To start a search on a
** new scan string, use kmp_settarget().  To find the next match of a
** given search string in a given target string, use kmp_search().  Note
** that kmp_setsearch() and kmp_settarget() do not copy the data in the
** source and target strings; the pointers must remain valid You can
** copy kmp_control structures for reuse if desired.
*/
typedef void *(*kmp_malloc)(size_t nbytes);
typedef void (*kmp_free)(void *data);

extern kmp_control *kmp_setsearch(const char *search, size_t schlen);
extern void kmp_settarget(kmp_control *ctrl, const char *target, size_t tgtlen);
extern const char *kmp_search(kmp_control *ctrl);
extern void kmp_release(kmp_control *ctrl);
extern void kmp_setalloc(kmp_malloc mem_alloc, kmp_free mem_free);

#endif /* KMP_H */

Being able to specify memory allocation functions is a tad unusual - but my code often works in an environment where memory allocation is not done via the standard malloc() and so on, and you must be able to switch the memory allocator on demand. You can ignore the two typedefs and the corresponding function; the default settings are, of course, to use malloc() and free().

The basic KMP algorithm code came from the site above - but was modified to allow me to set the search string once and then search multiple target strings, etc. Contact me (see my profile) for the source code. I have got a similar structure for Boyer-Moore code too (same original source), and also a case-insensitive Boyer-Moore code.

There's a good war story about strstr() and performance in Kernighan and Pike's excellent book "The Practice of Programming".


I did some experimentation - using a copy of the King James Bible (4.8 MB) as the plain text, and memory mapping that. For many searches, the (MacOS X 10.6.2 / BSD) strstr() was faster than either KMP or BM. When the strings grew long enough (12+ characters, approximately), then the BM algorithm finally outpaced strstr(). The KMP algorithm always seemed to be much slower.

Morals?

  • It is hard to out-pace a good library.
  • KMP is much slower than BM on plausible English language strings.

And the infrastructure I put in place around the algorithms may be too heavy - but the alternative in the original code is a callback mechanism, which presents some problems for determining the context of matches.

Jonathan Leffler