views:

189

answers:

3

I want to post russian text on a CP1251 site using LWP::UserAgent and get following results:

# $text="Русский текст"; obtained from command line
FIELD_NAME => $text                                # result: Г?в г'В?г'В?г'В?г?вєг?вёг?в? Г'В'Г?вчг?вєг'В?г'В'
$text=Encode::decode_utf8($text);
FIELD_NAME => $text                                # result: Р с?с?с?рєрёр? С'Рчрєс?с'
FIELD_NAME => Encode::encode("cp1251", $text)     # result: Г?гіг+г+гЄгёгЏ ГІгҐгЄг+гІ
FIELD_NAME => URI::Escape::uri_escape_utf8($text) # result: D0%a0%d1%83%d1%81%d1%81%d0%ba%d0%b8%d0%b9%20%d1%82%d0%b5%d0%ba%d1%81%d1%82

How can I do this? Content-Type must be x-www-form-urlencoded. You can find similar form here, but there you can just escape any non-latin character using &#...; form, trying to escape it in FIELD_NAME results in 10561091108910891 10901077108210891 (every &, # and ; stripped out of the string) or 1056;усский текст (punctuation characters at the beginning of the string are stripped out) depending on what the FIELD_NAME actually is.

UPDATE: Anybody knows how to convert the following code so that it will use LWP::UserAgent::post function?

my $url=shift;
my $fields=shift;
my $request=HTTP::Request->new(POST => absURL($url));
$request->content_type('application/x-www-form-urlencoded');
$request->content_encoding("UTF-8");
$ua->prepare_request($request);
my $content="";
for my $k (keys %$fields) {
    $content.="&" if($content ne "");
    my $c=$fields->{$k};
    eval {$c=Encode::decode_utf8($c)};
    $c=Encode::encode("cp1251", $c, Encode::FB_HTMLCREF);
    $content.="$k=".URI::Escape::uri_escape($c);
}
$request->content($content);
my $response=$ua->simple_request($request);

This code actually solves the problem, but I do not want to add the third request wrapper function (alongside with get and post).

A: 

One way around it appears to be (far from the best, I think) to use recode system command if you have it avialable. From http://const.deribin.com/files/SignChanger.pl.txt

my $boardEncoding="cp1251"; # encoding used by the board
$vals{'Post'} = `fortune $forunePath | recode utf8..$boardEncoding`;
$res = $ua->post($formURL,\%vals);

Another approach seems to be in http://mail2lj.nichego.net/lj.txt

my        $formdata = $1 ;
my        $hr = ljcomment_string2form($formdata) ;
my        $req = new HTTP::Request('POST' => $ljcomment_action)
        or die "new HTTP::Request(): $!\n" ;

$hr->{usertype} = 'user' ;
$hr->{encoding} = $mh->mime_attr('content-type.charset') ||
                  "cp1251" ;
$hr->{subject}  = decode_mimewords($mh->get('Subject'));
$hr->{body} = $me->bodyhandle->as_string() ;

$req->content_type('application/x-www-form-urlencoded');
$req->content(href2string($hr)) ;

my      $ljres = submit_request($req, "comment") ;

if ($ljres->{'success'} eq "OK") {
    print STDERR "journal updated successfully\n" ;
} else {
    print STDERR "error updating journal: $ljres->{errmsg}\n" ;
    send_bounce($ljres->{errmsg}, $me, $mh->mime_attr("content-type.charset")) ;
}
DVK
remember, don't use variables in a command like this unless you're sure it's secure. instead, open two pipes. possibly one a two-way.
sreservoir
@sreservoir correct... the code is probably fairly old... this is merely an example of enccoding, not a gold copy :)
DVK
What the difference between `recode` and using `Encode::encode`? And why not `iconv`?
ZyX
I have another suggestion: is there any way to dump exact contents of whatever browser is sending? If I will know this it will be easier to construct similar request.
ZyX
Dare I ask if Data::Dumper barfs on printing `$ua` object with weird russkie encodings?
DVK
@DVK $ua object does not contain any Russian texts, it appears only in some request and almost all response objects.
ZyX
DVK, the WTF to code ratio is alarmingly high for those two snippets. I wish you wouldn't post garbage you come across on Google. A user with your reputation should know that CPAN's the gold standard.
daxim
A: 

Use WWW::Mechanize, it takes care of encoding (both character encoding and form encoding) automatically and does the right thing if a form element's accept-charset attribute is set appropriately. If it's missing, the form defaults to UTF-8 and thus needs correction. You seem to be in this situation. By the way, your example site's encoding is KOI8-R, not Windows-1251. Working example:

use utf8;
use WWW::Mechanize qw();
my $message = 'Русский текст';
my $mech = WWW::Mechanize->new(
    cookie_jar => {},
    agent => 'Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/533.9 SUSE/6.0.401.0-2.1 (KHTML, like Gecko)',
);
$mech->get('http://zhurnal.lib.ru/cgi-bin/comment?COMMENT=/z/zyx/index_4-1');
$mech->current_form->accept_charset(scalar $mech->response->content_type_charset);
$mech->submit_form(with_fields => { TEXT => $message });

HTTP dump (essential parts only):

POST /cgi-bin/comment HTTP/1.1
Content-Length: 115
Content-Type: application/x-www-form-urlencoded

FILE=%2Fz%2Fzyx%2Findex_4-1&MSGID=&OPERATION=store_new&NAME=&EMAIL=&URL=&TEXT=%F2%D5%D3%D3%CB%C9%CA+%D4%C5%CB%D3%D
daxim
WWW::Mechanize successfully failed when I tried to use it with the same result as on the third line.
ZyX
I have edited the answer to rectify this problem.
daxim
A: 

These functions solve the issue (first for posting application/x-www-form-urlencoded data and second for multipart/form-data):

#{{{2 postue
sub postue($$;$) {
    my $url=shift;
    my $fields=shift;
    my $referer=shift;
    if(defined $referer and $referer eq "" and defined $fields->{"DIR"}) {
        $referer=absURL($url."?DIR=".$fields->{"DIR"}); }
    else {
        $referer=absURL($referer); }
    my $request=HTTP::Request->new(POST => absURL($url));
    $request->content_type('application/x-www-form-urlencoded');
    $request->content_encoding("UTF-8");
    $ua->prepare_request($request);
    my $content="";
    for my $k (keys %$fields) {
        $content.="&" if($content ne "");
        my $c=$fields->{$k};
        if(not ref $c) {
            $c=Encode::decode_utf8($c) unless Encode::is_utf8($c);
            $c=Encode::encode("cp1251", $c, Encode::FB_HTMLCREF);
            $c=URI::Escape::uri_escape($c);
        }
        elsif(ref $c eq "URI::URL") {
            $c=$c->canonical();
            $c=URI::Escape::uri_escape($c);
        }
        $content.="$k=$c";
    }
    $request->content($content);
    $request->referer($referer) if(defined $referer);
    my $i=0;
    print STDERR "Doing POST request to url $url".
        (($::o_verbose>2)?(" with fields:\n".
                ::YAML::dump($fields)):("\n"))
        if($::o_verbose>1);
  REQUEST:
    my $response=$ua->simple_request($request);
    $i++;
    my $code=$response->code;
    if($i<=$o_maxtries and 500<=$code and $code<600) {
        print STDERR "Failed to request $url with code $code... retrying\n"
            if($::o_verbose>2);
        sleep $o_retryafter;
        goto REQUEST;
    }
    return $response;
}
#{{{2 postfd
sub postfd($$;$) {
    my $url=absURL(shift);
    my $content=shift;
    my $referer=shift;
    $referer=absURL($referer) if(defined $referer);
    my $i=0;
    print STDERR "Doing POST request (form-data) to url $url".
        (($::o_verbose>2)?(" with fields:\n".
                ::YAML::dump($content)):("\n"))
        if($::o_verbose>1);
    my $newcontent=[];
    while(my ($f, $c)=splice @$content, 0, 2) {
        if(not ref $c) {
            $c=Encode::decode_utf8($c) unless Encode::is_utf8($c);
            $c=Encode::encode("cp1251", $c, Encode::FB_HTMLCREF);
        }
        push @$newcontent, $f, $c;
    }
  POST:
    my $response=$ua->post($url, $newcontent,
                           Content_type => "form-data",
                           ((defined $referer)?(referer => $referer):()));
    $i++;
    my $code=$response->code;
    if($i<=$o_maxtries and 500<=$code and $code<600) {
        print STDERR "Failed to download $url with code $code... retrying\n"
            if($::o_verbose>2);
        sleep $o_retryafter;
        goto POST;
    }
    return $response;
}
ZyX
I pity the person who is going to maintain this ersatz CGI library.
daxim
@daxim 1. Why? 2. It is not a CGI library.
ZyX