ansaurus

Question

How do I match a Russian word in Unicode text using Perl?

Answer 1

A:

Okay, then try this:

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;

my $response = $ua->get("http://ru.wikipedia.org/wiki/Perl");

die $response->status_line unless $response->is_success;

my $content = $response->decoded_content;

my @russian = $content =~ /\s([\x{0400}-\x{052F}]+)\s/g;

print map { "$_\n" } @russian;

I believe that the Cyrillic character set starts at 0x0400 and the Cyrillic supplement character set ends at 0x052F, so this should get many of the words.

Chas. Owens 2009-05-01 02:38:01

Thanks that throwsw a warning: "Malformed UTF-8 character (overflow at 0x4043e433, byte 0xd1, after start byte 0xbf) in pattern match (m//) at retrieveFromWiki.pl line 45." and retrieves everything including english terms (in english characters)

2009-05-01 02:48:09

Answer 2

+2 A:

perl -MLWP::Simple -e 'getprint "http://ru.wikipedia.org/wiki/Perl"'
403 Forbidden <URL:http://ru.wikipedia.org/wiki/Perl&gt;

Well, that doesn't help!

Downloading a copy first, this seems to work:

use Encode;

local $/ = undef;
my $text = decode_utf8(<>);

my @words = ($text =~ /([\x{0400}-\x{04ff}]+)/gs);

foreach my $word (@words) {
  print encode_utf8($word) . "\n";
}

Bron Gondwana 2009-05-01 03:08:19

ansaurus

tags:

views:

answers:

How do I match a Russian word in Unicode text using Perl?

related questions