views:

224

answers:

9

I need to search incoming not-very-long pieces of text for occurrences of given strings. The strings are constant for the whole session and are not many (~10). Additional simplification is that none of the strings is contained in any other.

I am currently using boost regex matching with str1 | str2 | .... The performance of this task is important, so I wonder if I can improve it. Not that I can program better than the boost guys, but perhaps a dedicated implementation is more efficient than a general one.

As the strings stay constant over long time, I can afford building a data structure, like a state transition table, upfront.

e.g., if the strings are abcx, bcy and cz, and I've read so far abc, I should be in a combined state that means you're either 3 chars into string 1, 2 chars into string 2 or 1 char into string 1. Then reading x next will move me to the string 1 matched state etc., and any char other than xyz will move to the initial state, and I will not need to retract back to b.

Any ideas or references are appreciated.

A: 

Regex engine initialization is expected to have some overhead, so if there are no real regular expressions involved, C - memcmp() should do fine.

If you can tell the file sizes and give some specific use cases, we could build a benchmark (I consider this very interesting).

Interesting: memcmp explorations and timing differences

Regards

rbo

rubber boots
At the very end of the `memcmp` article: http://news.ycombinator.com/item?id=607954 which shows that Boyer Moore beats the crap out of an optimized `memcmp`... and I would probably not advise `memcmp` in C++, we have a `std::string` class with a `find` method after all, no need to deal with the nasties.
Matthieu M.
@Matthieu: good point, I'd like to test this in a scenario that matches the conditions of the O.P. - if he would describe them in some detail.
rubber boots
+3  A: 

Take a look at Suffix Tree.

wilhelmtell
A: 

There is always Boyer Moore

doron
A: 

Following the suggestion you put forward in the question, here's a roll-your-own solution.

Make a class that stores your target string and holds a position of where you are in your comparison whereAmI. The compare method returns true iff you've got a match, updates whereAmI if you've got a partial match, or resets whereAmI if there's no match.

roughly:

class myStrCmp {
 public: 
  myStrCmp(const string &setTarget) { whereAmI=0; target=setTarget; }
  bool compare(char c) {
    if (target[whereAmI]==c) {
      whereAmI++;
    } else {
      whereAmI=0;
    }

    if (whereAmI==target.size()) {
      whereAmI=0;
      return true;
    } else {
      return false;
    }
  }
 private:
  string target;
  size_t whereAmI;
}

Make a container to hold one instance for each of the strings you want to match, iterate over the container for each character in the file. You may wish to add a reset method to the class in case you don't want overlapping words (reset each myStrCmp object whenever one matches).

I can't say whether this would be faster than regex.

flies
That OO thing has clearly gone way over the top, now people are even creating __classes for algorithms__! I'm sorry to say, but this is plain _stupid_. Classes categorize objects, which wrap state and operations to change the state. Algorithms are implemented by _functions_. Oh, and C++ is not Pascal, so `Class` vs. `class` is a difference. And it's not Java (or C# or whatever) either, so you don't copy around strings needlessly, you [pass them per `const` ref](http://stackoverflow.com/questions/2139224/how-to-pass-objects-to-functions-in-c/2139254#2139254). All combined, this warrants `-1`.
sbi
Your second two gripes are tedious and irrelevant to solving the problem. Does "roughly" mean nothing to you? I didn't include a destructor either. I guess I deserve a slap on the wrist for that too!?As for your idea that functions are for algorithms and classes are for objects, it makes sense as a general rule. In this case, the algorithm involves a state (whereAmI) and associated data (target), so it makes perfect sense to use a class. Unless you can show me that using a function results in clearer code I don't see how you're not just blindly applying a general rule.
flies
@flies: Rather than getting defensive and complaining about perfectly legitimate issues with your code, you should be attempting to learn and fix your code. -1, because *all* of what sbi says here are valid complaints.
Billy ONeal
It is hard to learn from a comment that says "you are wrong" and little more. It's especially hard to avoid getting defensive when criticisms are voiced as dismissively and rudely as in sbi's remarks. There is a state (whereAmI) in the algorithm, so according to sbi ("classes... wrap state and operations to change state") it is correct to use a class. Why not?
flies
@flies: I believe sbi quite specifically mentioned 4 issues with the above code. His comment is not a "this sucks, have a nice day". At least fixing the misspelling of `class` so that the code would have compiled, and fixing the pass-by-reference-to-const issue, are indisputably the right things to do. If you disagree with his point point on the class, that's fine, that's more a matter of opinion. But leaving a comment to the effect of "I fixed issue 2, 3 and 4, however I disagree with you on issue 1 because..." would have been fine. As it stands, this has turned into nothing but a flame war.
Billy ONeal
I regret that this discussion got to such tones. Apart from the obvious C++ issue (mainly the "pass by ref", ignoring the `Class` typo), I find this answer quite useful. Instead of a global state transition table I suggested that knows everything on all, which can be quite tricky to generate, the states here are distributed between strings, each taking care of itself. This of course increases the number of comparisons, but can serve as initial go-no-go test: if this performs better than what I have now, the issue is worth looking further at. +1 for me...
davka
@sbi: re "algorithm as a class" issue, there is a well-known `Strategy` pattern doing exactly what you claim to be "stupid". There are good reasons to encapsulate behavior in a class, so I wouldn't automatically reject this.
davka
there seems to be bugs in this solution after all: first, if there's no match, you need to re-compare to the first character of the string, second, you need to retract in the text to where the potential match started, or you may miss a match, e.g. `string = aab, text = aaab`.
davka
@davka: Yes, there are reason's for wrapping algorithms in classes (that's what functors are all about, after all), but I couldn't seen one in this case. That might be my error or just a matter of opinion, but I specifically wrote "all combined, this warrants" a down-vote. I wouldn't have down-voted just for the class/algorithm issue by itself (or any of the other, FTM). In fact, had flies fixed the obvious errors, I had removed my down-vote, even without any arguments in favor of the class.
sbi
@flies: look at http://imgs.xkcd.com/comics/duty_calls.png and live by it!
rubber boots
@flies I usually find it very hard work to shrink such comments below 500 characters, which certainly makes for some terseness which sometimes comes across as impolite. Also, where I work I have a well-earned reputation to simply blurt out "that's BS" when I think something is. Sometimes that makes life hard for me (some bosses don't like this), but that's the way I am. Honestly, I'm very sorry if what I wrote offended you. Rest assured that it was meant as a critique of your code, not your person.
sbi
@rubber: Oh, that's a cute one!
sbi
@flies: BTW, you should properly @address replies in comments, so that they show up in the Response tab of whoever you're replying to. If it wasn't for davka's comment @me, I wouldn't even have seen any of your comments.
sbi
@sbi thanks for your response. must grow thicker skin :) You were correct to point out the obvious errors, and I appreciate your commentary otherwise.
flies
A: 

Beside Rabin-Karp-Algorithm and Knuth-Morris-Pratt-Algorithm, my Algorithm-Book suggests a Finite State Machine for String Matching. For every Search String you need to build such a Finite State Machine.

Christian Ammer
That's what boost::regex is already doing.
Billy ONeal
Do you say that because you've read the source to BOOST, or is there some other reason you know that BOOST doesn't do depth-first non-deterministic matching the way its user documentation would suggest?
Ian
+3  A: 

Check out the Aho–Corasick string matching algorithm!

maxschlepzig
A: 

I've been looking at the answers but none seem quite explicit... and mostly boiled down to a couple of links.

What intrigues me here is the uniqueness of your problem, the solutions exposed so far do not capitalize at all on the fact that we are looking for several needles at once in the haystack.

I would take a look at KMP / Boyer Moore, for sure, but I would not apply them blindly (at least if you have some time on your hand), because they are tailored for a single needle, and I am pretty convinced we could capitalized on the fact that we have several strings and check all of them at once, with a custom state machine (or custom tables for BM).

Of course, it's unlikely to improve the big O (Boyer Moore runs in 3n for each string, so it'll be linear anyway), but you could probably gain on the constant factor.

Matthieu M.
+1  A: 

Look at this: http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/configuration/algorithm.html

The existence of a recursive/non-recursive distinction is a pretty strong suggestion that BOOST is not necessarily a linear-time discrete finite-state machine. Therefore, there's a good chance you can do better for your particular problem.

The best answer depends quite a bit on how many haystacks you have and the minimum size of a needle. If the smallest needle is longer than a few characters, you may be able to do a little bit better than a generalized regex library.

Basically all string searches work by testing for a match at the current position (cursor), and if none is found, then trying again with the cursor slid farther to the right.

Rabin-Karp builds a DFSM out of the string (or strings) for which you are searching so that the test and the cursor motion are combined in a single operation. However, Rabin-Karp was originally designed for a single needle, so you would need to support backtracking if one match could ever be a proper prefix of another. (Remember that for when you want to reuse your code.)

Another tactic is to slide the cursor more than one character to the right if at all possible. Boyer-Moore does this. It's normally built for a single needle. Construct a table of all characters and the rightmost position that they appear in the needle (if at all). Now, position the cursor at len(needle)-1. The table entry will tell you (a) what leftward offset from the cursor that the needle might be found, or (b) that you can move the cursor len(needle) farther to the right.

When you have more than one needle, the construction and use of your table grows more complicated, but it still may possibly save you an order of magnitude on probes. You still might want to make a DFSM but instead of calling a general search method, you call does_this_DFSM_match_at_this_offset().

Another tactic is to test more than 8 bits at a time. There's a spam-killer tool that looks at 32-bit machine words at a time. It then does some simple hash code to fit the result into 12 bits, and then looks in a table to see if there's a hit. You have four entries for each pattern (offsets of 0, 1, 2, and 3 from the start of the pattern) and then this way despite thousands of patterns in the table they only test one or two per 32-bit word in the subject line.

So in general, yes, you can go faster than regexes WHEN THE NEEDLES ARE CONSTANT.

Ian
A: 

You can do it with very popular Lex & Yacc tools, with the support of Flex and Bison tools. You can use Lex for getting tokens of the string. Compare your pre-defined strings with the tokens returned from Lexer. When match is found, perform the desired action. There are many sites which describe about Lex and Yacc. One such site is http://epaperpress.com/lexandyacc/

bjskishore123