I have a website I want to regexp on, say http://www.ru.wikipedia.org/wiki/perl . The site is in Russian and I want to pull out all the Russian words. Matching with \w+
doesn't work and matching with \p{L}+
retrieves everything.
How do I do it?
I have a website I want to regexp on, say http://www.ru.wikipedia.org/wiki/perl . The site is in Russian and I want to pull out all the Russian words. Matching with \w+
doesn't work and matching with \p{L}+
retrieves everything.
How do I do it?
Okay, then try this:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
my $response = $ua->get("http://ru.wikipedia.org/wiki/Perl");
die $response->status_line unless $response->is_success;
my $content = $response->decoded_content;
my @russian = $content =~ /\s([\x{0400}-\x{052F}]+)\s/g;
print map { "$_\n" } @russian;
I believe that the Cyrillic character set starts at 0x0400
and the Cyrillic supplement character set ends at 0x052F
, so this should get many of the words.
perl -MLWP::Simple -e 'getprint "http://ru.wikipedia.org/wiki/Perl"'
403 Forbidden <URL:http://ru.wikipedia.org/wiki/Perl>
Well, that doesn't help!
Downloading a copy first, this seems to work:
use Encode;
local $/ = undef;
my $text = decode_utf8(<>);
my @words = ($text =~ /([\x{0400}-\x{04ff}]+)/gs);
foreach my $word (@words) {
print encode_utf8($word) . "\n";
}