views:

42

answers:

3

Using only pure ruby (or justifiably commonplace gems) is there an efficient way to search a large binary document for a specific string of bytes?


Deeper context: the mpeg4 container format is a 4-byte indexed serialised data structure, without having to parse the structure fully (I can assume it is valid) I want to pull out specific tags.

For those of you that haven't come across this 'dmap' serialization before it works something like this:

<4-byte length<4-byte tag><4-byte length><4-byte type definition><8 bytes of something I can't remember><data>

eg, this defines the 'tvsh' (or TV Show) tag as being 'Futurama'

00 00 00 20  ... 
74 76 73 68  tvsh
00 00 00 18  ....
64 61 74 61  data
00 00 00 01  ....
00 00 00 00  ....
46 75 74 75  Futu
72 61 6D 61  rama

The exact structure isn't really important, I'd like to write a method which can pull out the show name when I give it 'tvsh' or that it's season 2 if I give it 'tvsn'.

My first plan would be to use Regular Expressions, but I get the (unjustified) feeling that this would be slow.

Let me know your thoughts! Thanks in advance

+1  A: 

If I understand your description correctly, whole file consists of a number of such "blocks" of a fixed structure?

In that case, I suggest scanning one by one, and skipping ones not of interest to you. So, your each step should do the following:

  1. Read 8 bytes (using IO#readbytes or a similar method)
  2. From the read header, extract the size (first 4 bytes), and the tag (second 4)
    1. If the tag is the one you need, skip following 16 bytes and read size-24 bytes.
    2. If the tag is not of interest, skip following size-16 bytes.
  3. Repeat.

For skipping bytes, you can use IO#seek.

Mladen Jablanović
One annoying aspect of the format is that atoms(blocks) can be nested.
BaroqueBobcat
Nothing a nice piece of recursion couldn't solve! ;)
Mladen Jablanović
There is this library, haven't used it though: http://github.com/arbarlow/ruby-mp4info
BaroqueBobcat
This has to be the right way forward, this way I don't have to scan through all the atoms I don't actually need (seems most mov files have the metadata at the *end* of the file). Just gotta figure out which atoms to push into! Oh, and figure out why some don't fit the pattern…
JP
A: 

Theoretically you can use regexes against any arbitrary data, including binary strings. HTH.

rogerdpack
Nothing theoretical about it, as long as the regex engine you're using has an 8-bit mode where one byte equals one character.
Jan Goyvaerts
A: 

In Ruby you can use the /n flag when creating your regex to tell Ruby that your input is 8-bit data.

You could use /(.{4})tvsh(.{4})data(.{8})([\x20-\x7F]+)/n to match 4 bytes, tvsh, 4 bytes, data, 8 bytes, and any number of ASCII characters. I don't see any reason why this regex would be significantly slower to execute than hand-coding a similar search. If you don't care about the 4-byte and 8-byte blocks, /tvsh.{4}data.{8}([\x20-\x7F])/n should be nearly as fast as a literal text search for tvsh.

Jan Goyvaerts