views:

94

answers:

3

I know that I can use regex to match substrings in a string, but is it possible to match some patterns in binary data using regex? If so then in what format should the binary data be - binary array, stream, or something else?

edit:

well to explain i have binary data that shouldnt have some strings inside but the data itself is binary so i need to detect this pattern of data so i mark this data as invalid.
but i couldnt convert this binary data to string since it would be invalid. maybe only to some char[] or something.

edit:

now i am thinking maybe converting the binary data to a basic encoding (any hints on which is the most basic encoding available? certainly not unicode, i think ascii?) and then i will use regex.
but the question would i be able to convert any binary data to string using this encoding or i will encounter some cases which will be invalid and will cause exceptions when converting the binary data to string.

+1  A: 

Yes it is possible but why would you want to? You would need to encode the data as a string first of course but if you are going to go to that trouble why don't you simply deserialize the data into a more sensible data structure?

Regular expressions are for matching strings only - if have binary data then you can be quite sure that a regex is the wrong solution to your problem.

Andrew Hare
well the binary data i have can contain string but mostly its binary. i just need to detect some string patterns that will mark the data as invalid.
Karim
+2  A: 

The technical answer to your question is yes, since you could just treat the binary data as a string of a particular encoding, but I don't believe that's what you're asking.

If you're asking if there's a library designed to do pattern matching on an array of bytes, then the .NET regex system will not do this and there isn't such a library that I'm aware of.

Adam Robinson
i dont want to treat this data as string. but is there any other method i can use to achieve this without using regex?
Karim
@Karim: Not that I am aware of, but there are plenty of tutorials and explanations online about writing your own implementation of regular expressions. I wouldn't imagine it would be incredibly difficult to adapt one of these to work on binary data rather than text.
Adam Robinson
@Adam Robinson thanks but i dont think that implementing my own regex for binary would be feasible. i will try to convert the binary data to string using ascii encoding and hope this will work fine :)
Karim
Just do a one-to-one mapping of the elements of your binary data to unicode characters, run regular expressions on the resulting string, then invert the mapping to get back to bytes. Your idea seems fine, Adam. There is nothing special about strings of readable characters, regular expressions are far more general than that.
Joren
what do you mean by one-to-one mapping? this is not the same as System.Text.Encoding.Unicode.GetString(binary[]) ?
Karim
@Karim: If you're going to use string manipulation, I would suggest ASCII, as it's a single-byte encoding (so you'll get one character for every byte). Using a variable-byte encoding can/will result in the length of the string not necessarily corresponding to the length of your byte array.
Adam Robinson
@Adam Robinson thanks i wrote an implementation using ascii , will see how that will turn out in real world :)
Karim
@Adam: "so you'll get one character for every byte" That'd be true if ASCII had more than 128 characters.
Joren
@Joren i think the ascii encoding of .net will include 256 characters not 128, not sure though. of course not all of them will be printable but that is not important. regex will work on it anyway.
Karim
@Karim: I'm not sure, I think I remember ASCIIEncoding complaining if you want it to convert bytes values over 127. But it's worth trying at least. :) If it doesn't work, you can always manually convert byte values to unicode code points.
Joren
Are you sure? http://msdn.microsoft.com/en-us/library/system.text.asciiencoding.aspx says otherwise: "Since ASCIIEncoding supports only the Unicode character values between U+0000 and U+007F (...)"
Joren
@Joren: Interesting, I was always under the impression that `ASCIIEncoding` took advantage of the extended ASCII character set, but evidently not. In order to use that, you'll probably need to use the Western European ISO standard. Instead of using `Encoding.ASCII`, use `Encoding.GetEncoding(28591)`. The various unicode sets will not work for this technique, as you can end up having the decoder interpret multiple bytes as a single character.
Adam Robinson
@Adam Robinson: isnt the Windows-1252 encoding better suited? because i see in wikipedia that all the 256 characters in windows-1252 represented but in ISO 8859-1 some are blank. http://en.wikipedia.org/wiki/Windows-1252 http://en.wikipedia.org/wiki/Iso-8859-1
Karim
@Karim: Either one should work as long as they map to individual characters, even if the character isn't printable. In any case, it's probably worth trying both.
Adam Robinson
A: 

I haven't tried this, but I'll bet you could convert your binary data to a base64 string, then use a regex to find your search string - of course, you would have to encode your search string in base64 as well.

Ray
i dont think this will work. because the byte offsets will be different in base64 string, i mean the data wont be byte (8 bits) alligned but instead 6 bits aligned
Karim
I thought if you converted the search string as well it might work out - but now that you mention it, I can see that it would work maybe only 1 out of 3 times, when the relevant bytes just happen to line up right. Oh well, sounded good when I wrote it... :(
Ray
well thanks for your input. all ideas are good since they can lead to other ideas even if the first idea dont work first. :)
Karim