views:

480

answers:

4

I want to scrape string data from some binary text files that contain embedded SQL statements. I don't need any fancy cleanup--just some way to extract the readable text. I'm using vb.net, but a call to an external utility would work too.

+1  A: 

This is not so trivial as it may seem at first. A string can be encoded in many ways. What you consider "readable text", how do the unreadable parts look? Say it looks like this:

 &8)JÓxZZ`\■£ÌS?E?L?E?C?T?*?F?R?O?M?m?y?T?b?l?§ıÍ4¢

you are lucky, because it is likely encoded using UTF-16 or another multibyte encoding. These are rather trivial to recognize. But in just about all other cases (UTF-8, ISO-8859-1, Windows-1252) it is next to impossible to distinguish an individual character for being text or non-text, unless you know a fair deal of how a certain "readable text" starts and how it ends.

The point is: anything is allowed and considered readable text. UTF-8, ASCII and Windows-1252 allow even NULL characters (while some programming languages don't). Here's a thread that gives a VB example of how you can proceed, it might give you some hints.

PS: analyzing this type of data can be hard, it will help a great deal if you could upload your file somewhere so we can have a look.

Abel
When I open the textfiles I want to look at in notepad the string portions I care about are clearly visible. I assumed all I need to do is strip out anything non-string and I'd be set?
Jeff
I wished you were correct, but it isn't that easy. Notepad doesn't *know* these strings, it just displays them. Look at any binary file (i.e. an image) and you will find "readable" parts. Suppose you look at it character by character, can you positively select a range of characters that are always "string", throughout the whole file?
Abel
+2  A: 

For reference: http://technet.microsoft.com/en-us/sysinternals/bb897439.aspx

Stu
+3  A: 

The GNU strings utility has been around forever and does more-or-less exactly this by using a heuristic to yank any data that "looks like a string" from a binary.

Grab the GNU binutils (including strings) for Win32 from MinGW: http://sourceforge.net/projects/mingw/files/.

Derrick Turk
FYI, the output of strings will contain a lot of false positives, but given that you know the grammar of the strings you're looking for (SQL statements), it won't be hard to filter only what you're looking for.
Derrick Turk
+1, esp because I looked at it as well and considered it unsuitable. Was I wrong! ;-). Note that using heuristics is not a Rosetta stone...
Abel
A: 

Thanks all. Great ideas. Really helped me think. Upvotes all around. Ended up I didn't need to be very sure that they were strings so I went with a quick, sloppy, ugly, hack.

 'strip out non-string characters 
 For Each b As Byte In byteArray
      If b = 9 Or b = 10 Or b = 13 Or (b > 31 And b < 127) Then
          newByteArray(i) = b.ToString
          i += 1
      End If
  Next

  'move it into a string
  resultString = System.Text.Encoding.ASCII.GetString(newByteArray)
Jeff