views:

77

answers:

1

Hi,

I am reading data from a (serial) port using a non-blocking read() function (in C/C++). This means the data comes-in in chunks of undetermined (but reported) sizes (incl. 0) each time I "poll" the port. I then need to parse this "stream" for certain patterns (not XML).

My naive implementation concatenates the new string to the previous stream-string each time read() returns a non-zero buffer, and re-parses the whole string. When a pattern is matched, the relevant part is discarded leaving only the tail of the string for the next time.

Obviously there are much more efficient way to do this, for example, incremental parsing a-la SAX, deque-like buffers or similar string slices etc. Also, obviously, I am not the first to have to do this type of stream parsing.

Does anyone know of any library that already does this sort of thing? Preventing memory-overflow in case of missing pattern matches would also be a big plus.

Thanks, Adi

A: 

You can do some tricks depending on your pattern.

  • Looking for one character like a newline you only need to scan the new portion of the string.
  • If you are looking for \r\n then you only need to scan the new portion starting with the last character of the previous portion.
  • If you have a pattern with a known ending part then you only have to scan for that ending to know if you need to scan the whole buffer.
  • If you have some kind of synchronizing character, like semicolon in C/C++, then you can parse or declare a parse error whenever you find one.
Zan Lynx
What if the pattern is a regex? It is too early to answer without more information, IMO.
Moron
@Moron: You can still use those tricks with a regex. If your regex has a static end pattern you can look for that in the new buffer data before running the whole regex.
Zan Lynx
@Zan: Suppose the regex was ^someregex$. Then we can solve this in constant space by constructing a DFA and running the stream through it.
Moron
I have to support several different devices, so I wanted a more generalized parser. Currently, I look for a header-string and footer-string and return this "sentence" for later additional parsing. Actually, I was hoping for a more general parser, e.g. regexp.Also, I am looking for something that also takes care of putting the stream together from the chunks of data that are received via the stream. I'm now looking at Boost.Asio to see if it fit this.
Adi
@Moron: If you have access to the guts of the DFA implementation then you can feed data into the DFA in read-buffer sized chunks. In that case you never suffer from reparsing the buffer from start.
Zan Lynx
@Zan: Yes, all I was saying is that once we know exactly what the OP is asking, we can do better than just give general guidelines which might not even be applicable to OP's specific case. That is all.
Moron