Parsing HTTP - Bytes.length != String.length

views:

112

answers:

+2 Q:

Parsing HTTP - Bytes.length != String.length

Hello,

I consume HTTP via nio.SocketChannel, so I get chunks of data as Array[Byte]. I want to put these chunks into a parser and continue parsing after each chunk has been put.

HTTP itself seems to use an ISO8859-Charset but the Payload/Body itself may be arbitrarily encoded: If the HTTP Content-Length specifies X bytes, the UTF8-decoded Body may have much less Characters (1 Character may be represented in UTF8 by 2 bytes, etc).

So what is a good parsing strategy to honor an explicitly specified Content-Length and/or a Transfer-Encoding: Chunked which specifies a chunk-length to be honored.

append each data-chunk to an mutable.ArrayBuffer[Byte], search for CRLF in the bytes, decode everything from 0 until CRLF to String and match with Regular-Expressions like StatusRegex, HeaderRegex, etc?
decode each data-chunk with the proper charset (e.g. iso8859, utf8, etc) and add to StringBuilder. With this solution I am not able to honor any Content-Length or Chunk-Size, but.. do I have to care for it?
any other solution... ?

+1 A:

You could use UTF-16, which is Java's internal String representation anyway. It's 2 bytes for each character, except when there's a surrogate. So you could scan the string for surrogate characters up to the length allowed, account for them as appropriate, and just copy the substrings.

Daniel 2010-06-10 20:29:56

Thanks for the hint, will need to look for these surrogates...Currently I have severe problems properly using the CharsetDecoder, which throws MALFORMED[1] from time to time. My try is there: http://github.com/hotzen/Thesis/blob/master/src/dataflow/io/http/Parser.scala#L416Appreciate any comments.

hotzen 2010-06-11 10:03:41

I accumulate all Array[Byte] in an ArrayBuffer which allows me to count bytes. HTTP Protocol decoding (Status + Headers) is done by searching for the CRLF-position and then decoding 0 until CRLF with ISO8859.

Chunked Bodies are accumulated in the ArrayBuffer and only decoded with the specified charset if the chunk has been fully saved in the ArrayBuffer. This circumvents MALFORMED exceptions from the CharsetDecoder if decoding utf8 data which is split right in the middle of a 2-byte character.

For streaming HTML I have no good solution yet, normal HTML is buffered in the ArrayBuffer and decoded after the whole document has been received (like the chunks).

hotzen 2010-06-27 10:56:27

ansaurus

tags:

views:

answers:

Parsing HTTP - Bytes.length != String.length

related questions