views:

276

answers:

4

In my application, I receive a URL-UTF8 encoded string of characters, which is split up by the sending client. After splitting, each message part includes some header information which is meant to be used to reconstruct the message.

With English characters, it's pretty straightforward

String content = new String(request.getParameter("content").getBytes("UTF-8"));

I store this in along with the header information in a buffer for each received part. When all parts have been received, I simply recompose the message by concatenating each individual part according to header information.

With languages that use 16-bit encodings this is sometimes not working as expected. Everything works fine if the split does NOT happen in the middle of a single character.

For instance here's a string of three Hebrew characters being sent by the client:

%D7%93%D7%99%D7%91

If this winds up split as follows: {%D7%93%D7%99} {%D7%91}, reconstruction isn't a problem.

However sometimes the client splits it up in the middle (example: {%D7%93%D7} {%99%D7%91})

When this happens, after reconstruction I get two � characters at the boundary point instead of the single correct Hebrew character.

I thought the inability to correctly retain the single byte information was related to passing around strings, so I tried passing around byte array from request.getParameter("content").getBytes("UTF-8") to the buffer without wrapping in the string joining together the byte arrays. In the buffer I joined all these arrays BEFORE converting the final array to a string.

Even after doing this, it appears I still "lost" that information held by the single bytes. I'm guessing this is because the getBytes("UTF-8") method can't correctly resolve the single bytes since they are not valid characters. Is that right?

Is there any way I can get around this and preserve these tail/head bytes?

A: 

You must first collect all bytes and then convert them all at once into a string.

Aaron Digulla
That's what I did. Passed all byte arrays to buffer, and created a new array that joins all of these guys. Does not work.
bernie
No, that's no what you did. You converted the already broken Strings back to bytes.
Michael Borgwardt
A: 

You never need to convert a string to bytes and then to a String java, it is completely pointless. Once a series of bytes have been decoded to a String it is in Java String encoding (UTF-16E I think).

The problem you have is that the application server is making an assumation about the encoding of the incoming HTTP request, usually the platform encoding. You can give the application server a hint as to the expected encoding by calling ServletRequest.setCharacterEncoding(String) before anything else calls getParameter().

Browser's assume that form fields should be submitted back to the server using the same encoding that the page was served with. This is a general rule as the HTTP spec doesn't have a way to specify the encoding of the incoming request, only the response.

Spring has a nice Filter to do this for you CharacterEncodingFilter if you define this as the every first filter in web.xml most of your encoding issue will go away.

Gareth Davis
Setting the encoding won't enable the server to decode incomplete multibyte characters.
Michael Borgwardt
+2  A: 

Your client is the problem here. Apparently it treats the text data as a byte array for the purpose of splitting it up, and then sending the invalid fragments as text (HTTP request parameters are inherently textual). At that point, you have already lost.

You either have to change the client to split the data as text (i.e. along character boundaries), or change your protocol to send the fragments as binary data, i.e. not as a parameter but as the request body, to be retrieved via ServletRequest.getInputStream() - then, concatenating the data before decoding it should work.

(Caveat: the above assumes that you are indeed writing Servlet code, which I inferred from the request.getParameter() method; but even if that's a coincidence the same principles apply: either split the data as a String before any conversion to byte[] happens on the client side, or make sure you concatenate the byte arrays on the server before any conversion to String happens.)

Michael Borgwardt
Thanks Michael, it is indeed Servlet code. I however have significantly less flexibility in modifying the client end (which happens to be written in C and spits out the url-encoded bytes of the message)Do I have any other options on the server-end? Some means of directly retrieving bytes of the request parameter before it's string-ified?
bernie
A: 

Following scheme is a hack but it should work in your case,

  • Set you server/page in Latin-1 mode. If this is a GET, client has no way to set encoding. You have to do this on server's end. For example, you need to add URIEncoding="iso-8859-1" in connector for Tomcat.

  • Get content as Latin1. It will be wrong value at this point but don't worry,

    String content = request.getParameter("content");

  • Concatenate the string as Latin-1.

    data = data + content;

  • When you get the whole thing, you need to re-encode the string as UTF-8 like this,

    String value = new String(data.getBytes("iso-8859-1"), "utf-8");

The value should contain the correct characters.

ZZ Coder
Please, please don't use this horrible workaround when there is any chance at all of actually fixing the problem...
Michael Borgwardt
It will fix the problem. I had exact the same problem before. If you have no way to change the client, this is the only way to go.
ZZ Coder