tags:

views:

104

answers:

3

Assume we capture packets with the C API of libpcap. Is it efficient to parse some payload strings with string search strstr() in line speed (e.g. Mbps/Gbps)? For example strstr(payload,"User-Agent");

Would it be more efficient to do it with a regular expression pattern matching library, such as libpcre?

If we want to do that only for HTTP header arguments, is there any C API? It is not clear to me if libcurl can do that... thank you in advance.

A: 

I really can't imagine strstr being any slower than a regular-expression alternative - however, if you need to pull out various HTTP header values then parsing the packets would be a pretty straighforward, better option. Does libpcap not include any inbuilt parsers?

Will A
libpcap C API may pull out information from TCP/IP headers but not from the payload. Since HTTP headers are part of the payload, they need to be parsed by another way.
A: 

http://www.arstdesign.com/articles/fastsearch.html has some metrics showing that strstr is decently performant. For short string matches, I doubt a regex library can beat good optimized assembly.

Sanketh I
Thank you for you answer. It looks strstr to be the fastest choice.
+1  A: 

If you are only searching for a single short string, then nothing will be much faster than the linear comparison used by strstr(). That said, strstr()'s special treatment of NUL bytes is almost certainly not what you want for examining network traffic, and you would be better off writing your own implementation which treated all bytes the same and accepted length parameters.

If you're searching for multiple strings, you're better off using a fast string-matching algorithm like Aho–Corasick or building a state machine which matches the strings you want in the context you want—i.e., a parser. For parsing a mostly-regular grammar like HTTP's in C, the ragel state machine compiler is my tool of choice.

llasram
I am searching multiple strings... 1) I cannot really understand why a state machine would be better for this case (e.g. strstr(payload,"GET")!=NULL would point exactly to the GET and therefore I can parse the strings after that) and 2) why is ragel state machine better than using strncmp? Thank you!
If you have an n-byte packet and m strings you might like to find in it, then a linear search for each string is at least O(m*n). With a state-machine approach -- either Aho-Corasick etc or a parser -- you'll just do a single linear pass over the data. If you're trying to find structured information, such as an HTTP verb followed by a correctly formatted host-relative URI, followed by "HTTP/" then a version, then using a parser-generator will save you significant pain by allowing you to rigorously describe your expected input.
llasram