views:

72

answers:

1

Hi all.

I'm parsing HTTP data directly from packets (either TCP reconstructed or not, you can assume it is).

I'm looking for the best way to parse HTTP as accurately as possible.

The main issue here is the HTTP header.

Looking at the basic RFC of HTTP/1.1, it seems that HTTP header parsing would be complex. The RFC describes very complex regular expressions for different parts of the header.

Should I write these regular expressions to parse the different parts of the HTTP header?

The basic parsing I've written so far for HTTP header is for the generic HTTP header:

message-header = field-name ":" [ field-value ]

And I've included replacing inner LWS with SP and repeating headers with the same field-name with comma separated values as described in section 4.2.

However, looking at section 14.9 for example would show that in order to parse the different parts of the field-value I need a much more complex parsing scheme.

How do you suggest I should handle the complex parts of HTTP parsing (specifically the field-value) assuming I want to give the parser users the full capabilities of HTTP and to parse every part of HTTP?

Design suggestions for this would also be appreciated.

Thanks.

+2  A: 

I would follow the Principal of Single Responsibility. Rather than trying to create a single monolithic parser that knows every detail of every HTTP header known to man, go simpler. Write a simple extensible parser that in and of itself is responsible for just dealing with parsing the field name and associating that name with the raw value. Then make use of pluggable extensions that are only responsible for parsing a single kind of header. When you create an instance of your parser, inject a collection of extensions, and map each extension to a set of field names that it knows how to parse.

You kill two birds with one stone with this approach. Your core parser remains simple and targeted. You also gain the ability to extend your parser without having to mess around with its guts, which results in more robust code.

jrista