views:

201

answers:

1

I have list of catalog paths and need to filter out some of them. My match pattern is in a non-Unicode encoding.

I tried the following:

require 5.004;
use POSIX qw(locale_h);
my $old_locale = setlocale(LC_ALL);
setlocale(LC_ALL, "ru_RU.cp1251");

@{$data -> {doc_folder_rights}} = 
       grep {
              # catalog path pattern in $_REQUEST{q}
              $_->{doc_folder} =~/$_REQUEST{q}/i; 
            } 
            @{$data -> {doc_folder_rights}};

setlocale(LC_ALL, $old_locale);

What I need is case-insensitive regexp pattern matching when pattern contains russsian letters.

+2  A: 

There are several (potential) issues with your code:

  1. Your code filters out all doc_folders that do not match the regexp in $_REQUEST{q}, however the question suggests that you want to do the opposite.

  2. You might have an encoding issue. Setting the locale (using setlocale) changes the perl's handling of upper- & lower-case-conversions, but it does not change any encoding. You need to assure that $_REQUEST{q} is interpreted correctly.

For simplicity you can assume that any Perl-string contains Unicode-data in some internal representation that you need not know about in detail. Only when Perl does I/O there is an implicit or explicit conversion. When reading from stdin, ARGV or environment, Perl assumes that the bytes are encoded using the current locale and implicitly converts.

If you have an encoding issue, there are several ways to fix it:

  1. Fix the environment in which Perl runs so that it knows about the correct locale from the very start. That will fix the implicit conversion.
  2. In the unlikely case that $_REQUEST is loaded from a filehandle, you could explicitly tell Perl to convert using binmode($fh, ":encoding(cp1251)");. Do that prior the reading $_REQUEST.
  3. There is the $string = Encode::decode(Encoding, $octets) function that tells Perl to forget its assumption about the encoding of $octets and instead treat the contents of $octets as byte-stream that needs to be converted to Unicode using Encoding. You need to do that before touching the contents of $octets, or strange things may happen.
  4. Since $_REQUEST was probably loaded by some cgi-module, and was probably url-encoded in transit, you could just tell the cgi-module how to correctly do the decoding.
edgar.holleis