ansaurus

Question

Answer 1

+2 A:

See the HTTP spec for the rules, which says in section 2.2

The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT MAY contain characters from character sets other than ISO- 8859-1 [22] only when encoded according to the rules of RFC 2047 [14].

The above code will not correctly decode an RFC2047 encoding string, leading me to believe that the service doesn't correctly follow the spec, and they just embeding raw utf-8 data in the header.

superfell 2008-11-27 19:54:13

Answer 2

+3 A:

As mentioned already the first look should always go to the HTTP 1.1 spec (RFC 2616). It says that text in header values must use the MIME encoding as defined RFC 2047 if it contains characters from character sets other than ISO-8859-1.

So here's a plus for you. If your requirements are covered by the ISO-8859-1 charset then you just put your characters into your request/response messages. Otherwise MIME encoding is the only alternative.

As long as the user agent sends the values to your custom headers according to these rules you wont have to worry about decoding them. That's what the Servlet API should do.

However, there's a more basic reason why your code sniplet doesn't do what it's supposed to. The first line fetches the header value as a Java string. As we know it's represented as UTF8 internally so at this point the HTTP request message parsing is already done and finished.

The next line fetches the byte array of this string. Since no encoding was specified (IMHO this method with no argument should have been deprecated long ago), the current system default encoding is used, which is usually not UTF8 and then the array is again converted as being UTF8 encoded. Outch.

mkoeller 2008-11-27 20:30:07

You are right about getBytes(). This can be fixed using getBytes("iso-8859-1").

ebruchez 2008-12-01 15:30:53

Answer 3

A:

Thanks for the answers. It seems that the ideal would be to follow the proper HTTP header encoding as per RFC 2047. Header values in UTF-8 on the wire would look something like this:

=?UTF-8?Q?...?=

Now here is the funny thing: it seems that neither Tomcat 5.5 or 6 properly decodes HTTP headers as per RFC 2047! The Tomcat code assumes everywhere that header values use ISO-8859-1.

So for Tomcat, specifically, I will work around this by writing a filter which handles the proper decoding of the header values.

ebruchez 2008-12-01 15:34:46

Look at javax.mail.internet.MimeUtility for this support: http://java.sun.com/j2ee/sdk_1.3/techdocs/api/javax/mail/internet/MimeUtility.html#encodeWord(java.lang.String)

Kevin Hakanson 2009-04-29 16:41:11

Answer 4

+2 A:

The HTTPbis working group is aware of the issue, and the latest drafts get rid of all the language with respect to TEXT and RFC 2047 encoding -- it is not used in practice over HTTP.

See http://trac.tools.ietf.org/wg/httpbis/trac/ticket/74 for the whole story.

Julian Reschke 2008-12-31 17:16:56

Answer 5

+3 A:

Again: RFC 2047 is not implemented in practice. The next revision of HTTP/1.1 is going to remove any mention of it.

So, if you need to transport non-ASCII characters, the safest way is to encode them into a sequence of ASCII, such as the "Slug" header in the Atom Publishing Protocol.

2008-12-31 20:24:24

But if the choice of encoding for custom HTTP headers is implementation specific, choosing RFC 2047 encoding is just as valid as any other encoding (such as the one from Atom which you mention). So there is no reason *not* to use RFC 2047 encoding.

Todd Owen 2010-08-21 23:55:53

ansaurus

tags:

views:

answers:

HTTP headers encoding/decoding in Java

related questions