views:

877

answers:

3

I seek an example of applying a regular expression to a Java I/O stream that doesn't simply convert the stream to a string as I would like to preserve binary data. Most of the examples on the Internet focus on text data...

A: 

Convert the stream to a byte array.

tpdi
+3  A: 

Regex operations must be performed on strings, which are encoded bytes of binary data. You can't perform regex operations on bytes of data you have no idea what they represent.

Yuval A
-1 I disagree. There is no reason why you cannot apply regular expressions to binary data. Binary data does not mean you don't have idea what they represent.
Marcelo Morales
Supposedly, you could take a stream of 0's and 1's and perform regex on it. However none of the existing Java APIs give you access to that raw stream without converting it to something more meaningful.
Yuval A
+1 agree, Applying a regexp on binary data does not make sense. Regexps are fundamentally geared towards Strings, they're defined using Strings, so you'll always be using a string encoding, either explicitly or implicitly.
Michael Borgwardt
I'm not voting up or down, but suppose you had a "binary" protocol like ASN.1 or Java serialization. It would make sense to look for regular expressions in such a string of bytes.
erickson
There is a danger that some portion of the binary data might match your regexp by coincidence. In which case you may end up making a bogus match, or corrupting the binary data. Depending on your subject data and regexp, you may be able to discard such concerns. But in the general case, binary data can contain strings which do not actually represent strings, implying a risk of false matches. That is why it would be better practice to separate the data first, and why a truly general solution does not exist. Having said that, I upvoted the other answer, because it helps the OP more. ;)
joeytwiddle
+3  A: 

The needed functionality is not present on Java Standard. You will have to use jakarta regexp, and specifically, the StreamCharacterIterator class. This class encapsulates a InputStream for use in regexp operations.

If you want to use the standard regular expression package, I would suggest take a the source from the previous class here and change the contract by implementing CharSequence instead of CharacterIterator.

Marcelo Morales