views:

91

answers:

5

Hello everyone!

We're trying to implement a proxy proof of concept but have encountered an interesting question: since a single HTTP connection can, and indeed should, make mutliple requests, and the HTTP transactions are sent via multiple packets (due to TCP's magic) is it possible for a HTTP request to begin in the middle of a packet?

Bare in mind that this is not a theoretical question regarding possible optimization of the browser, but whether it actually happens in real life. It would be even better if someone could point me to a written reference on whther or not this is possible and if so how often it can occur.

Clarification update: we know that if we work in the HTTP layer alone we would not need to bother with this question, however we're trying to figure out if some advanced technique could be applied by working on the TCP layer first.

Thanks ahead, Aviad.

+2  A: 

First of all, TCP is a stream based protocol and has no concept of packets. HTTP itself might have some kind of message or record delimiter, but TCP doesn't.

This page might be helpful: Structure of HTTP Transactions

From your question it sounds like you think that each read from a TCP socket is a "packet" of data. In reality, each read simply reads as many bytes as are in the buffer up to the maximum that you requested, without any concept of records or packets.

So for instance, lets say you read 2048 bytes from the socket, you could have the tail end of one transaction, followed by the beginning of a second response half way through the data you read, and only get the remainder of your second response on your next read from the socket.

If you're here in Jerusalem or near by maybe I could help you out.

Robert S. Barnes
What you're saying is obviously theoretically correct, and that's what we thought as well. Also, no good HTTP documentation denies that this is an option. However, we were not able to create such an event using normal browsers as of yet, and wondered if there's some 'dark arts' limitations that all browsers adhere to that we're not aware of.
Aviad Ben Dov
@Aviad: It's not clear to me exactly what kind of event you are referring to. If you mean the next transaction starting in the middle of a buffer you've read, what have you tried? If you write a short script to issue say 20 GET commands using HTTP 1.1 for a set of 1kb files and do a 20kb `recv` from the socket using the `MSG_WAITALL` flag then I think you would almost be guaranteed to get multiple files back from separate transactions in one read. You should end up with [ack1 header1 file1 ack2 header2 file2, etc..] all in one read in your buffer.
Robert S. Barnes
That might be but it wouldn't be a "real world situation", i.e. browser.. Or would it?
Aviad Ben Dov
Let's say you are a browser and you want to load a page with a bunch of images. You request the page, parse it and get the paths of 10 images. There are basically 3 ways to fetch them: 1. Serially, HTTP/1.0 style getting one image per connection. 2. HTTP/1.1 style, reuse persistent connection issuing requests for all 10 images one after the other. This may cause a problem known as 'head of line blocking'. 3. Either in the same process or in separate threads, for each image open a separate connection. This solves head of line blocking problem. I think most modern browsers use method #3.
Robert S. Barnes
Cont: The problem is that method #2 is completely valid and has to be dealt with in a robust application. And in case #2 you can definitely realistically get the beginning of a transaction in the middle of a read buffer. There's really no way around this because HTTP/1.1 requires that you support chunked data with no content-length header.
Robert S. Barnes
From SeaMonkey 2.0 Help* Use HTTP 1.1: Choose this to use the new version of HTTP, which offers performance enhancements, including more efficient use of HTTP connections, better support for client-side caching, multiple HTTP requests (pipelining), and more refined control over cache expiration and replacement policies. * Enable Keep-Alive: Select this to keep a connection open to make additional HTTP requests, increasing speed. * Enable Pipelining: Select this to enable pipelining, which allows for more than one HTTP request to be sent to the server at once, reducing delays loading web pages.
Robert S. Barnes
A: 

Depends of which abstraction layer of a packet you are talking about: there are many layers underneath HTTP.

HTTP --> TCP (byte stream) --> IP (packet) --> (possibly something else) Ethernet (frame) --> (possibly) some other transport

If you are talking about the IP layer, then yes the HTTP layer would start later on... Note that TCP presents a "byte stream interface" to its Client layer hence, no concept of packet here.

jldupont
+2  A: 

Assuming that you are talking about IP packets: Yes, it is possible that HTTP request starts middle of IP packet.

When you are using persistent HTTP connections, that is, using same TCP connection for several HTTP requests, it is fully possible that request boundary is middle of IP packet.

Also there is a TCP protocol between IP and HTTP. TCP contains also some headers so a IP packet may start with some TCP headers and rest of the packet consists of HTTP request.

HTTP request may also consist of several IP packets (in case of file uploads, transmission errors and following retransmissions etc).

However, I wonder why you are interested in packets if you are working at HTTP level. TCP should hide the IP packet details.

Juha Syrjälä
A: 

Unless you are implementing your own TCP stack, you should not need to worry about the packets, but rather about the API that the TCP provides, in case of POSIX interfaces it would be the recv() or read(). So I treat the question then as "Can more than one HTTP requests come into a single read(), and can the HTTP request be split between multiple read() requests?" -- The answer to both would be "yes, it is possible".

An example of where this can happen is HTTP pipelining. This not frequent in real life (ironically, at least some of the browsers disable it by default because of "buggy proxies" :-) - but when it happens, can be a bit of a problem for the users to diagnose - especially if they have no access to the proxy.

One very notable place where it does happen by default apt-get in Debian-derived linux systems. Just install a Debian or Ubuntu server and try to use it through your proxy. You can do that by editing the /etc/apt/apt.conf.d/proxy file and placing the following there:

Acquire::http::Proxy "http://your.proxy.address:8080";
Andrew Y
A: 

I think I understand where you are trying to go with this question.

If you don't use persistent HTTP connections, the HTTP GET request header is always the very first thing which is sent over the TCP connection, so we can be sure that the start of the HTTP GET request header does "not start in the middle of some TCP packet". But keep in mind that there may be one or more TCP packets without any user data, e.g. only a SYN, which may preceed the TCP packet with the start of the HTTP GET request header. And also keep in mind that the HTTP GET request header may not be contained in a single TCP packet.

If you do use persistent HTTP connections, the start of the HTTP GET request header for request number N+1 can start in the middle of a TCP packet, namely after the end of HTTP GET request body of request number N.

If you are asking these questions you are possibly "doing it wrong". As several other responders have already pointed out, in the vast majority of cases you should probably just be a TCP client and deal with a TCP stream of data and let the TCP code worry about the TCP packets. (Unless, of course, you are working on some special hardware which is looking at individual IP packets as they fly by and try to do some processing at the HTTP layer.)

Cayle Spandon