views:

470

answers:

2

Hi

I searching for fast and safe way to apply Regular Expressions on Streams.

I found some examples over the internet that talking about converting each buffer to String and then apply the Regex on the string.

This approach have two problems:

  • Performance: converting to strings and GC'ing the strings is waste of time and CPU and sure can be avoided if there was a more native way to apply Regex on Streams.
  • Pure Regex support: Regex pattern sometimes can match only if combining two buffers together (buffer 1 ends with the first part of the match, and buffer 2 starts with the second part of the match). The convert-to-string way cannot handle this type of matches natively, i have to provide more information like the maximum length that the pattern can match, this does not support the + and * regex signs at all and will never support (unlimited match length).

So, the convert-to-string way is not fast, and dont fully supports Regex.

Is there any way / Library that can be used to apply Regex on Streams without converting to strings and with full regex support?

Thanks.

+1  A: 

Perhaps this article could help? Though I suppose it might be the "things on the internet" you found that weren't of help.

Building a Regular Expression Stream Search with the .NET Framework

Amber
+1  A: 

It seems that you would know the start and end delimiters of the matches you are trying to get, correct? (i.e. [,] or START,END etc.) So would it make sense to search for these delimiters as data from your stream comes in and then creating a sub-string between the delimiters and do further processing on those?

I know it's pretty much the same thing as rolling your own, but it will be with a more specific purpose and even be able to process it as it comes in.

The problem with regular expressions in this instance is that they work based on matches so you can only match against the amount of input you have. If you have a stream, you would have to read in all the data to get all the matches (space / time constraint issue), try to match against the character at a time brought in (pretty useless), match in chunks (again, something can be easily missed there) or generate strings of interest which if they match your criteria can be shipped off elsewhere for further processing.

Fry