tags:

views:

60

answers:

1

I'm trying to search a text for a match and return it with snippet around it. For this, I want to find match with regex, then cut the string using match index +- snippet radius (text.mb_chars[start..finish]).

However, I cannot get ruby's (1.8) regex to return match index which would be multi-byte aware.

I understand that regex is one place in 1.8 which is supposed to be utf aware, but it doesn't seem to work despite /u switch:

"Résumé" =~ /s/u
=> 3

"Resume" =~ /s/u
=> 2

Result should be the same if regex was really working in multibyte (/u), but it's returning byte index.

How you get match index in characters, not bytes?

Or maybe some other way to get snippet around (each) match?

A: 

Not a real answer, but too long for a comment.

The code

print "Résumé" =~ /s/u
print "\n"
print "Resume" =~ /s/u

on Windows (Ruby 1.8.6, release 26.) prints:

2
2

And on Linux (ruby 1.8.7 (2009-06-12 patchlevel 174) [i486-linux]) it prints:

3
2
Bart Kiers
hmm, on a mac I have ruby 1.8.6 (2008-08-11 patchlevel 287) [universal-darwin9.0], and I just checked on ec2 where there is ruby 1.8.6 (2007-09-24 patchlevel 111) [i486-linux], with same result (ie. 3, 2)
Otigo
@Otigo, yes it is odd...
Bart Kiers