tags:

views:

123

answers:

6

I want to analyze a message's type for best performance, the message is begin with constant string and one space followed. The constant strings belongs to one known list of string array, like "CUT", "GET", "LOGIN" ...

So I do not like to memcmp(data, "GET", 3) thing repeatedly which is bad for performance. I wonder is there any better solution. Maybe I can compile this constant string arrays into a DFA for quick string match, but I do not know how to do it, and is there any other better solution?

Possible use lexer to do this?

+1  A: 

If there are a limited number of constant strings, you could just hand write your own "dfa". Just examine the first character. If it's a 'c' and the only string in your array that starts with a 'c' is "CUT" then you can break off early because you are done. If there are two strings that start with C, examine the second character, etc. This is obviously not a good solution if there are a ton of possible strings.

There is a GNU C regex library that is probably what you want. I would probably recommend learning a little more about regular expressions with a simpler language like Perl or Python, and then learning the C library when you are comfortable with your reg. exes in general.

Also, I am confused why you are talking about memcpy. Did you mean memcmp? And why would you use that instead of strcmp?

FranticPedantic
regex library would works for that. But I am not sure it's a lightweight solution. For memcmp, I do not see much difference for this problem with strncmp.
arsane
Because I think strcmp will short-circuit at the first difference.
FranticPedantic
+2  A: 

Take a look at Ragel. And at Mongrel for a real-world use. Though I found the mail parsing example that is enclosed with ragel to be a fun small one to experiment with, too.

Though, depending on your protocol, just a check on the first byte might get you down to a single subsequent memcmp() just to verify that your verb is indeed the correct one. 'C', 'L', 'G' are all different values.

Andrew Y
@Andrew, here I did not list the full long list of strings. :)
arsane
@arsane: that's why I said "depending" - since I do not know much about your use case :-)
Andrew Y
A: 

Erm, yea you should NOT use memcpy() or any mem() function when dealing with strings. Why? Well string functions take into account terminating null characters as the former is more of a raw copy of bytes. When dealing with strings always use string.h functions.

in70x
+1  A: 

One thing you can do is have your program iterate over your list of command strings at startup, and use them to build a lookup tree. Then at runtime you can do efficient lookups by navigating down the tree, at each node choosing the next child node based on the next letter in the string, until you either come to a leaf node (in which case you have your match) or hit a dead end (no child nodes for the next letter) and you know that there is no match.

(constructing the tree is pretty easy -- it's pretty much the same algorithm as the lookup algorithm, except that when you don't find a child node for the next letter, you create one and add it to the current node, and then continue)

Jeremy Friesner
+1  A: 

I like ternary search trees for this application. The lookup time is O(m) where m is the length of the input string.

There is also a somewhat helpful article about this data structure here.


Another approach

If your strings happend to be 4 ASCII characters or shorter, you could store them in a 32-bit integer and then do a constant time comparison using a switch statement. If you are able to use 64-bit ints, then you can compare up to 8 ASCII characters.

A function to represent up to the first 4 characters of a string as an integer could look like:

#include <inttypes.h>

uint32_t str_as_int(const char* s) {
  uint32_t n = 0;
  int i;
  for (i = 0; s[i] != '\0' && i < sizeof(uint32_t); i++)
    n |= s[i] << (24 - i * 8);
  return n;
}
Whisty
Nice. I like the approach for len <= 4 strings.
Alok
A: 

I would first use the very simple algorithm you reject out of hand. First get it to work, then get it fast. If I then found I really needed to focus in on that particular tiny part of my system for optimization, I would choose to do something nearly as simple as the obvious solution, but an order of magnitude or so faster.

The obvious thing that comes to mind is to replace one list of candidate strings, with say 26 lists of candidate strings. You've probably guessed, 26 is the number of letters in the alphabet, and each of the strings in a list of candidates now starts with the same letter. Look at the first letter of your message, and use a fast lookup table to select the appropriate candidate list. So if your first letter is 'C', search the candidate list { "COPY", "CUT", "CLOSE" }.

If that's not fast enough, then I would get serious about some heavyweight solution, but I am skeptical about how often such a thing would really be necessary.

Bill Forster