views:

86

answers:

1

Some web servers return content-length set to zero in the HTTP response headers. I'd like a deterministic and performant solution for receiving all the data in that situation.

URL known to exhibit this behavior (additional URLs below):

http://www.washingtonpost.com/wp-dyn/content/article/2010/02/12/AR2010021204894.html?hpid=topnews

headers:

Cache-control:no-cache
Connection:close
Content-Encoding:gzip
Content-type:text/html
Server:Web Server
Transfer-encoding:chunked

My current solution is not guaranteed to get all the data due to the MaxTries constant and is slow due to Thread.Sleep()

private bool MoreDataIsAvailable()
{
    int avail = _socket.Available;
    if (avail == 0 &&
        _contentLength != null && _contentLength == 0)
    {
        int tries = 0;
        while (avail == 0 && tries < MaxTries)
        {
            Thread.Sleep(5);
            _socket.Poll(1000, SelectMode.SelectRead);
            avail = _socket.Available;
            tries++;
            if (avail > 0)
            {
                Console.WriteLine(_socket.Handle + " avail = " + avail + " received = " + _bytes.Length + " && tries = " + tries);
            }
        }
    }
    return avail > 0;
}

Usage in context:

private void ReceiveCallback(object sender, SocketAsyncEventArgs e)
{
    if (ConnectionWasClosed(e) || HadSocketError(e))
    {
        _receiveDone.Set();
        return;
    }

    StoreReceivedBytes(e);

    if (AllBytesReceived())
    {
        _receiveDone.Set();
        return;
    }

    if (MoreDataIsExpected() || MoreDataIsAvailable())
    {
        WaitForBytes(e);
    }
    else
    {
        _receiveDone.Set();
    }
}

Sample output:

1436 avail = 3752 received = 1704 && tries = 9
1436 avail = 3752 received = 9208 && tries = 8
1436 avail = 3752 received = 12960 && tries = 9
1436 avail = 3752 received = 20464 && tries = 8
1436 avail = 3752 received = 27968 && tries = 7
1436 avail = 7504 received = 31720 && tries = 1
1436 avail = 3752 received = 39224 && tries = 6

edit:

Nikolai observed that responses with a Transfer-encoding: chunked header need special handling but their ends can be detected deterministically.

Excluding the chunked responses, however, there are still other URLs that end up in my catch-all method, examples:

http://www.biomedcentral.com/1471-2105/6/197

headers:

Cache-control:private
Connection:close
Content-Type:text/html
P3P:policyref="/w3c/p3p.xml", CP="NOI DSP COR CURa ADMa DEVa TAIa OUR BUS PHY ONL UNI COM NAV INT DEM PRE"
Server:Microsoft-IIS/5.0
X-Powered-By:ASP.NET

http://slampp.abangadek.com/info/

headers:

Connection:close
Content-Type:text/html
Server:Apache/2.2.8 (Ubuntu) DAV/2 PHP/5.2.4-2ubuntu5.3 with Suhosin-Patch mod_ruby/1.2.6 Ruby/1.8.6(2007-09-24) mod_ssl/2.2.8 OpenSSL/0.9.8g
X-Cache:MISS from server03.abangadek.com
X-Powered-By:PHP/5.2.4-2ubuntu5.3

http://video.forbes.com/embedvideo/?format=frame&amp;height=515&amp;width=336&amp;mode=render&amp;networklink=1

headers:

Connection:close
Content-Language:en-US
Content-Type:text/html;charset=ISO-8859-1
Server:Apache-Coyote/1.1

I would like to know what I can look for in these responses that, like the Transfer-encoding header did for the first URL, gives a clue to reading the entire response deterministically so that the call to my method can be avoided.

+1  A: 

From the URL given it seems you are looking at HTTP Chunked Transfer Encoding, which allows the server to start transmitting the response before total length is known while still allowing the client to reliably determine end of the response.

Also see RFC 2616, section 3.6.1.

Nikolai N Fetissov
Nikolai, the wikipedia link was really useful. It helped to redirect a huge percentage of responses that were passing through my method. This URL http://www.biomedcentral.com/1471-2105/6/197 however exhibits the same behavior but does not have a transfer-encoding header. Do you have any insight in to what may be happening with this link?
Handcraftsman
This one sends **Connection: close** header. That means you just read till you hit EOF.
Nikolai N Fetissov
Nikolai, so far only trying and waiting in a loop for a specified number of attempts has a high probability of getting all the data when no/zero content length is given and it is not chunked. Is there something specific in the socket or elsewhere beyond what I'm already doing in order to know that I've reached EOF?
Handcraftsman
I'm not really familiar with C# socket API. In normal BSD socket land blocking **read()** would return 0 on EOF.
Nikolai N Fetissov
C# sockets are also supposed to return 0 if there is no more data. My usage shows, however, that if the current non-empty packet is actually the last packet, calling ReceiveAsync again will put the session into a terminal-wait state. So far I haven't found another way to detect that I have the final packet.
Handcraftsman
Hmm, this might be a problem. Can you try making a special case for **Connection: close** by doing blocking reads until EOF?
Nikolai N Fetissov
A blocking read for these cases definitely works but is deeply dissatisfying. One should be able to do whatever the synchronous call is doing to detect EOF without blocking.
Handcraftsman
It looks like you have lots of machinery there. I'd try to reproduce this particular problem with some very minimal code sample. Remove everything and just do asynchronous receive chain on that URL. Does it get stuck at the end? Might it be that you are making an incorrect assumption about the API somewhere, or have a logic error? What, for example, does **MoreDataIsExpected()** call do? If you insist on staying asynchronous one last hack is to set the **ReceiveTimeout** on seeing **Connection: close**.
Nikolai N Fetissov
I started off with http://msdn.microsoft.com/en-us/library/bew39x2a.aspx as my sample and only added on as I found limitations in that code. That sample does what you suggest, keep looping as long as non-zero bytes are returned. It locks up.
Handcraftsman
The HTTP spec does not explicitly say which side must close the connection, so it looks like the server can still keep it open.
Nikolai N Fetissov