I want to scrape string data from some binary text files that contain embedded SQL statements. I don't need any fancy cleanup--just some way to extract the readable text. I'm using vb.net, but a call to an external utility would work too.
This is not so trivial as it may seem at first. A string can be encoded in many ways. What you consider "readable text", how do the unreadable parts look? Say it looks like this:
&8)JÓxZZ`\■£ÌS?E?L?E?C?T?*?F?R?O?M?m?y?T?b?l?§ıÍ4¢
you are lucky, because it is likely encoded using UTF-16 or another multibyte encoding. These are rather trivial to recognize. But in just about all other cases (UTF-8, ISO-8859-1, Windows-1252) it is next to impossible to distinguish an individual character for being text or non-text, unless you know a fair deal of how a certain "readable text" starts and how it ends.
The point is: anything is allowed and considered readable text. UTF-8, ASCII and Windows-1252 allow even NULL characters (while some programming languages don't). Here's a thread that gives a VB example of how you can proceed, it might give you some hints.
PS: analyzing this type of data can be hard, it will help a great deal if you could upload your file somewhere so we can have a look.
The GNU strings utility has been around forever and does more-or-less exactly this by using a heuristic to yank any data that "looks like a string" from a binary.
Grab the GNU binutils (including strings) for Win32 from MinGW: http://sourceforge.net/projects/mingw/files/.
Thanks all. Great ideas. Really helped me think. Upvotes all around. Ended up I didn't need to be very sure that they were strings so I went with a quick, sloppy, ugly, hack.
'strip out non-string characters
For Each b As Byte In byteArray
If b = 9 Or b = 10 Or b = 13 Or (b > 31 And b < 127) Then
newByteArray(i) = b.ToString
i += 1
End If
Next
'move it into a string
resultString = System.Text.Encoding.ASCII.GetString(newByteArray)