views:

277

answers:

2

I have a website I want to regexp on, say http://www.ru.wikipedia.org/wiki/perl . The site is in Russian and I want to pull out all the Russian words. Matching with \w+ doesn't work and matching with \p{L}+ retrieves everything.

How do I do it?

A: 

Okay, then try this:

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;

my $response = $ua->get("http://ru.wikipedia.org/wiki/Perl");

die $response->status_line unless $response->is_success;

my $content = $response->decoded_content;

my @russian = $content =~ /\s([\x{0400}-\x{052F}]+)\s/g;

print map { "$_\n" } @russian;

I believe that the Cyrillic character set starts at 0x0400 and the Cyrillic supplement character set ends at 0x052F, so this should get many of the words.

Chas. Owens
Thanks that throwsw a warning: "Malformed UTF-8 character (overflow at 0x4043e433, byte 0xd1, after start byte 0xbf) in pattern match (m//) at retrieveFromWiki.pl line 45." and retrieves everything including english terms (in english characters)
+2  A: 
perl -MLWP::Simple -e 'getprint "http://ru.wikipedia.org/wiki/Perl"'
403 Forbidden <URL:http://ru.wikipedia.org/wiki/Perl&gt;

Well, that doesn't help!

Downloading a copy first, this seems to work:

use Encode;

local $/ = undef;
my $text = decode_utf8(<>);

my @words = ($text =~ /([\x{0400}-\x{04ff}]+)/gs);

foreach my $word (@words) {
  print encode_utf8($word) . "\n";
}
Bron Gondwana