ansaurus

Question

Ruby 1.9: Regular Expressions with unknown input encoding

Answer 1

+2 A:

As far as I am aware, there is no better method to use. However, might I suggest a slight alteration?

Rather than changing the encoding of the input, why not change the encoding of the regex? Translating one regex string every time you meet a new encoding is a lot less work than translating hundreds or thousands of lines of input to match the encoding of your regex.

# Utility function to make transcoding the regex simpler.
def get_regex(pattern, encoding='ASCII', options=0)
  Regexp.new(pattern.encode(encoding),options)
end



  # Inside code looping through lines of input.
  # The variables 'regex' and 'line_encoding' should be initialized previously, to
  # persist across loops.
  if line.methods.include?(:encoding)  # Ruby 1.8 compatibility
    if line.encoding != last_encoding
      regex = get_regex('<p>(.*)<\/p>',line.encoding,16) # //u = 00010000 option bit set = 16
      last_encoding = line.encoding
    end
  end
  line.match(regex)

In the pathological case (where the input encoding changes every line) this would be just as slow, since you're re-encoding the regex every single time through the loop. But in 99.9% of situations where the encoding is constant for an entire file of hundreds or thousands of lines, this will result in a vast reduction in re-encoding.

Myrddin Emrys 2009-12-22 00:26:05

Thanks! It had not occured to me to do it the other way round and encode the Regexp. That's indeed a lot faster!For anyone else trying to do this: Beware of dummy encodings (`#dummy?`) when you try to test your code. Took me a while to figure out why it wasn't working.

DataWraith 2009-12-22 09:57:20

Answer 2

A:

Follow the advice of this page: http://gnuu.org/2009/02/02/ruby-19-common-problems-pt-1-encoding/ and add

# encoding: utf-8

to the top of your rb file.

Sam 2010-06-29 12:20:35

ansaurus

tags:

views:

answers:

Ruby 1.9: Regular Expressions with unknown input encoding

related questions