views:

236

answers:

2

I'm working with UTF-8 strings. I need to get a slice using byte-based indexes, not char-based.

I found references on the web to String#subseq, which is supposed to be like String#[], but for bytes. Alas, it seems not to have made it to 1.9.1.

Now, why would I want to do that? There's a chance I'll end up with an invalid string should I slice in the middle of a multi-byte char. This sounds like a terrible idea.

Well, I'm working with StringScanner, and it turns out its internal pointers are byte-based. I accept other options here.

Here's what I'm working with right now, but it's rather verbose:

s.dup.force_encoding("ASCII-8BIT")[ix...pos].force_encoding("UTF-8")

Both ix and pos come from StringScanner, so are byte-based.

+1  A: 

You can do this too: s.bytes.to_a[ix...pos].join(""), but that looks even more esoteric to me.

If you're calling the line several times, a nicer way to do it could be this:

class String
  def byteslice(*args)
    self.dup.force_encoding("ASCII-8BIT").slice(*args).force_encoding("UTF-8")
  end
end

s.byteslice(ix...pos)
dvyjones
This is just stashing away the same code. I was wondering if there is not indeed a char slicer in ruby19.
kch
Sorry, but you seem to be a bit ambigual(?) here, do you want a char slicer or a byte slicer? As per ruby 1.9.1, there is no byte slicer without a bit of hacking. I personally like the first code in my answer the best, but that's up to you to choose.
dvyjones
I guess there isn't. Sometimes you have to add your own special use case to the Ruby standard library, and that's why the whole idea of open classes was invented.
Ken Bloom
Oops, I meant byte slicer.
kch
As foro your first solution, it reads prettier, and I'm generally all for that, but this particular line runs a million times, so I'll go with the one that performs best. And that's also why I'm particularly interested in ruby having its own internal implementation, because then it'd probably be in C in the base string code.
kch
+1  A: 

Doesn't String#bytes do what you want? It returns an enumerator to the bytes in a string (as numbers, since they might not be valid characters, as you pointed out)

str.bytes.to_a.slice(...)
Marc-André Lafortune
But I still need a substring, not an array of characters. It seems your suggestion would only make the operation a lot more expansive with going from enum to array and then making a utf8 string out of that again.
kch