tags:

views:

121

answers:

2

I'm trying to write a perl client program to connect to a Java server application (JDuplicate). I see that the java server uses The DataInput.readUTF and DataInput.writeUTF methods, which the JDuplicate website lists as "Java's modified UTF-8 protocol".

My test program is pretty simple, i'm trying to send client type data, which should invoke a response from the sever, however it just times out:

#!/usr/bin/perl

use strict;
use Encode;
use IO::Socket;

my $remote = IO::Socket::INET->new(
  Proto => 'tcp',
  PeerAddr => 'localhost',
  PeerPort => '10421'
) or die "Cannot connect to server\n";

$|++;

$remote->send(encode_utf8("CLIENTTYPE|JDSC#0.5.9#0.2"));
while (<$remote>) {
  print $_,"\n";
}
close($remote);

exit(0);

I've tried $remote->send(pack("U","..."));, I've tried "use utf8;", I've tried binmode($remote, ":utf8"), and I've tried sending just plain ASCII text, nothing ever gets responded to.

I can see the data being sent with tcpdump, all in one packet, but the server itself does nothing with it (other then ack the packet).

Is there something additional i need to do to satisfy the "modified" utf implementation of Java?

Thanks.

+3  A: 

This is unrelated to the main part of your question, but I thought I would explain what the "Java's modified UTF-8" that the API expects is; it's UTF-8, except with UTF-16 surrogate pairs encoded as their own codepoints, instead of having the characters represented by the pairs encoded directly in UTF-8. For instance, take the character U+1D11E MUSICAL SYMBOL G CLEF.

  • In UTF-8 it's encoded as the four bytes F0 9D 84 9E.
  • In UTF-16, because it's beyond U+FFFF, it's encoded using the surrogate pair 0xD834 0xDD1E.
  • In "modified UTF-8", it's given the UTF-8 encoding of the surrogate pair codepoints: that is, you encode "\uD834\uDD1E" into UTF-8, giving ED A0 B4 ED B4 9E, which happens to be fully six bytes long.

When using this format, Java will also encode any embedded nulls using the illegal overlong form C0 80 instead of encoding them as nulls, ensuring that there are never any embedded nulls in a "modified UTF-8" string.

If you're not sending any characters outside of the BMP or any nulls, though, there's no difference from the real thing ;)

Here's some documentation courtesy of Sun.

hobbs
Good info, thank you for clarifying.
Bryan Bueter
+4  A: 

You have to implement the protocol correctly:

First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException is thrown. Otherwise, this length is written to the output stream in exactly the manner of the writeShort method; after this, the one-, two-, or three-byte representation of each character in the string s is written.

As indicated in the docs for writeShort, it sends a 16-bit quantity in network order.

In Perl, that resembles

sub sendmsg {
  my($s,$msg) = @_;

  die "message too long" if length($msg) > 0xffff;

  my $sent = $s->send(
    pack(n => (length($msg) & 0xffff)) .
    $msg
  );

  die "send: $!"    unless defined $sent;
  die "short write" unless $sent == length($msg) + 2;
}

sub readmsg {
  my($s) = @_;
  my $buf;
  my $nread;

  $nread = $s->read($buf, 2);
  die "read: $!"   unless defined $nread;
  die "short read" unless $nread == 2;

  my $len = unpack n => $buf;

  $nread = $s->read($buf, $len);
  die "read: $!"   unless defined $nread;
  die "short read" unless $nread == $len;

  $buf;
}

Although the code above doesn't perform modified UTF encoding, it elicits a response:

my $remote = IO::Socket::INET->new(
  Proto => 'tcp',
  PeerAddr => 'localhost',
  PeerPort => '10421'
) or die "Cannot connect to server: $@\n";

my $msg = "CLIENTTYPE|JDSC#0.5.9#0.2";

sendmsg $remote, $msg;

my $buf = readmsg $remote;
print "[$buf]\n";

Output:

[SERVERTYPE|JDuplicate#0.5.9 beta (build 584)#0.2]
Greg Bacon
Perfect! This is exactly what I was looking for. I'm now able to communicate back and forth as expected. Thank you.
Bryan Bueter
You're welcome! I'm glad to help.
Greg Bacon