views:

92

answers:

4

Hello all,

i've created an C++ application using WinSck, which has a small (handles just a few features which i need) http server implemented. This is used to communicate with the outside world using http requests. It works, but sometimes the requests are not handled correctly, because the parsing fails. Now i'm quite sure that the requests are correctly formed, since they are sent by major web browsers like firefox/chrome or perl/C# (which have http modules/dll's).

After some debugging i found out that the problem is in fact in receiving the message. When the message comes in more than just one part (it is not read in one recv() call) then sometimes the parsing fails. I have gone through numerous tries on how to resolve this, but nothing seems to be reliable enough.

What i do now is that i read in data until i find "\r\n\r\n" sequence which indicates end of header. If WSAGetLastError() reports something else than 10035 (connection closed/failed) before such a sequence is found i discard the message. When i know i have the whole header i parse it and look for information about the body length. However i'm not sure if this information is mandatory (i think not) and what should i do if there is no such information - does it mean there will be no body? Another problem is that i do not know if i should look for a "\r\n\r\n" after the body (if its length is greater than zero).

Does anybody know how to reliably parse a http message?

Note: i know there are implementations of http servers out there. I want my own for various reasons. And yes, reinventing the wheel is bad, i know that too.

+2  A: 

You could try looking at their code to see how they handle a HTTP message.

Or you could look at the spec, there's message length fields you should use. Only buggy browsers send additional CRLFs at the end, apparently.

gbjbaanb
The HTTPbis WG has clarified message parsing; see http://greenbytes.de/tech/webdav/draft-ietf-httpbis-p1-messaging-11.html#message.body for the current draft text.
Julian Reschke
This looks good, thanks. If that helps i will gladly accept your answer.
PeterK
+2  A: 

If you're set on writing your own parser, I'd take the Zed Shaw approach: use the Ragel state machine compiler and build your parser based on that. Ragel can handle input arriving in chunks, if you're careful.

Honestly, though, I'd just use something like this.

Your go-to resource should be RFC 2616, which describes HTTP 1.1, which you can use to construct a parser. Good luck!

Jack Kelly
+1 for the http-parser and definitive links. That source would generate ***FAST*** code, I'm really impressed. That's badass.
Matt Joiner
A: 

HTTP GET/HEAD requests have no body, and POST request can have no body too. You have to check if it's a GET/HEAD, if it's, then you have no content (body/message) sent. If it was a POST, do as the specs say about parsing a message of known/unknown length, as @gbjbaanb said.

aularon
GET and HEAD request *can* have a body. So no, you don't check the method name.
Julian Reschke
@Julian, it's not exactly specified in HTTP specification whether you can include a body or not in GET/HEAD requests. I tested it locally and it works with apache, but I never seen that before in a real world implementation, I'm reading http://stackoverflow.com/questions/978061/ and http://stackoverflow.com/questions/1266596/ now, thanks for pointing that out.
aularon
@aularon whether something is used in practice and whether it's allowed are separate questions. What's important is that request parsing just is the same for all methods. (Contrary to response parsing where HEAD is special). See also http://trac.tools.ietf.org/wg/httpbis/trac/ticket/19 -- that's why were revising RFC 2616, after all.
Julian Reschke
@Julian sure thing.
aularon
A: 

Anyway HTTP request has "\r\n\r\n" at the end of request headers and before the request data if any, even if request is "GET / HTTP/1.0\r\n\r\n".

If method is "POST" you should read as many bytes after "\r\n\r\n", as specified in Content-Length field.

So pseudocode is:

read_until(buf, "\r\n\r\n");
if(buf.starts_with("POST")
{
   contentLength = regex("^Content-Length: (\d+)$").find(buf)[1];
   read_all(buf, contentLength);
}

There will be "\r\n\r\n" after the content only if content includes it. Content may be binary data, it hasn't any terminating sequences, and the one method to get its size is use Content-Length field.

Abyx
No, it does not depend on the method name. See http://greenbytes.de/tech/webdav/draft-ietf-httpbis-p1-messaging-11.html#message.body for details.
Julian Reschke
Also, keep in mind that HTTP 1.1 requests do not need to use a `Content-Length` header, either. They can use `Transfer-Encoding: chunked` instead, in which case the message length is encoded inside the message data itself.
Remy Lebeau - TeamB