I'm working with UTF-8 strings. I need to get a slice using byte-based indexes, not char-based.
I found references on the web to String#subseq
, which is supposed to be like String#[]
, but for bytes. Alas, it seems not to have made it to 1.9.1.
Now, why would I want to do that? There's a chance I'll end up with an invalid string should I slice in the middle of a multi-byte char. This sounds like a terrible idea.
Well, I'm working with StringScanner
, and it turns out its internal pointers are byte-based. I accept other options here.
Here's what I'm working with right now, but it's rather verbose:
s.dup.force_encoding("ASCII-8BIT")[ix...pos].force_encoding("UTF-8")
Both ix
and pos
come from StringScanner
, so are byte-based.