Matching regular expressions against non-Strings in Ruby without conversion

If a Ruby regular expression is matching against something that isn't a String, the to_str method is called on that object to get an actual String to match against. I want to avoid this behavior; I'd like to match regular expressions against objects that aren't Strings, but can be logically thought of as randomly accessible sequences of bytes, and all accesses to them are mediated through a byte_at() method (similar in spirit to Java's CharSequence.char_at() method).

For example, suppose I want to find the byte offset in an arbitrary file of an arbitrary regular expression; the expression might be multi-line, so I can't just read in a line at a time and look for a match in each line. If the file is very big, I can't fit it all in memory, so I can't just read it in as one big string. However, it would be simple enough to define a method that gets the nth byte of a file (with buffering and caching as needed for speed).

Eventually, I'd like to build a fully featured rope class, like in Ruby Quiz #137, and I'd like to be able to use regular expressions on them without the performance loss of converting them to strings.

I don't want to get up to my elbows in the innards of Ruby's regular expression implementation, so any insight would be appreciated.

You can't. This wasn't supported in Ruby 1.8.x, probably because it's such an edge case; and in 1.9 it wouldn't even make sense. Ruby 1.9 doesn't map its strings to bytes in any user-serviceable fashion; instead it uses character code points, so that it can support the multitude of encodings that it accepts. And 1.9's new optimized regex engine, Oniguruma, is also built around the same concept of encodings and code points. Bytes just don't enter into the picture at this level.

I have a suspicion that what you're asking for is a case of premature optimization. For any reasonable Ruby object, implementing to_str shouldn't be a huge performance hurdle. If it is, then Ruby's probably the wrong tool for you, as it abstracts and insulates you from your raw data in all sorts of ways.

Your example of looking for a byte sequence in a large binary file isn't an ideal use case for Ruby -- you'd be better off using grep or some other Unix tool. If you need the results in your Ruby program, run it as a system process using backticks and process the output.

ansaurus

tags:

views:

answers:

Matching regular expressions against non-Strings in Ruby without conversion

related questions