tags:

views:

540

answers:

3

Hello

I am interfacing with a Java application via Python. I need to be able to construct byte sequences which contain utf-8 strings. Java uses a modified utf-8 encoding in DataInputStream.readUTF() which is not supported by python (yet at least)

Can anybody point me in the right direction to construct java modified utf-8 strings in python?

Update #1: To see a little more about the java modified utf-8 check out the readUTF method from the DataInput interface on line 550 here, or here in the Java SE docs.

Update #2: I am trying to interface with a third party JBoss web app which is using this modified utf8 format to read in strings via POST requests by calling DataInputStream.readUTF (sorry for any confusion regarding normal java utf8 string operation).

Thanks in advance.

+1  A: 

Okay, if you need to read the format of DataInput.readUTF, I suspect you'll just have to convert the (well-documented) format into Python.

It doesn't look like it would be particularly hard to do. After reading the length and then the binary data itself, I suggest you use a first pass to work out how many Unicode characters will be in the output, then construct a string accordingly in a second pass. Without knowing Python I don't know the ins and outs of how to efficiently construct a string, but given the linked specification I can't imagine it would be very hard. You might want to look at the source for the existing UTF-8 decoder as a starting point.

Jon Skeet
A: 

Maybe this can help you, although it looks like it's the reverse of what you're doing:

Connecting a Java applet to a python SocketServer

Ölbaum
+2  A: 

You can ignore Modified UTF-8 Encoding (MUTF-8) and just treat it as UTF-8. On the Python side, you can just handle it like this,

  1. Convert the string into normal UTF-8 and stores bytes in a buffer.
  2. Write the 2-byte buffer length (not the string length) as binary in big-endian.
  3. Write the whole buffer.

I've done this in PHP and Java didn't complain about my encoding at all (at least in Java 5).

MUTF-8 is mainly used for JNI and other systems with null-terminated strings. The only difference from normal UTF-8 is how U+0000 is encoded. Normal UTF-8 use 1 byte encoding (0x00) and MUTF-8 uses 2 bytes (0xC0 0x80). First of all, you shouldn't have U+0000 (an invalid codepoint) in any Unicode text. Secondly, DataInputStream.readUTF() doesn't enforce the encoding so it happily accepts either one.

EDIT: The Python code should look like this,

def writeUTF(data, str):
    utf8 = str.encode('utf-8')
    length = len(utf8)
    data.append(struct.pack('!H', length))
    format = '!' + str(length) + 's'
    data.append(struct.pack(format, utf8))
ZZ Coder
sounds good, thansk. checking it out now
QAZ
I am learning Python so I converted my PHP function for you.
ZZ Coder