views:

362

answers:

6

After doing some search here, I found next to no questions on developing a web server.

I'm mainly going to be doing this for two reasons. As a side project and to learn more about developing a server program. This is not going to turn into a usable application, more of a learning tool

So the questions are simple.

  • Have you developed a web server? (no matter what language)
  • What are the gotchas and other good tips can you supply

Links to helpful sites are welcome, but don't link to a working project that is open source, since this is about the process of learning.

A: 

I was thinking of starting the same project as a way to learn Python better. There's a BaseHTTPServer class that's a pretty good starting point.

Here's some tutorial-style articles: 1 & 2

Evan Meagher
Yes, that is a good starting point, but what I would like to do is do everything from scratch... creating sockets and listeners. Thanks though.
Ólafur Waage
+1  A: 

The networking et al are pretty standard fair, so don't worry so much about that. (there are several "instant", sample network servers in most any language.)

Instead, focus on actually implementing the HTTP specification. You'll be amazed at a) what you don't know and b) how much things that are supposed to be HTTP compliant, really aren't, but fake it well.

Then you'll marvel that the web works at all.

When you're done with HTTP, enjoy trying to implement IMAP.

Will Hartung
+1  A: 

I wrote a light webserver in Python a few years back, also as a learning project.

The simplest piece of advice I can give, especially as a learning project, is build a core that works, then iterative design on top of that. Don't aim for the moon right off the hop, start very small, then add featuers, refine and continue. I would recommend using a tool that encourages expermentation, like Python, where you can literally type and test code at the same time.

Serapth
+4  A: 

Firstly, please don't let this become a usable project - getting security right for web servers is really hard.

Ok, here are things to keep in mind:

  1. The thread that accepts connections needs to hand off to background threads as soon as possible.
  2. You can't have a thread for every single connection - with large volumes you'll run out of your thread limit.
  3. Use some kind of a worker thread pool to handle your requests.
  4. Ensure that you scrub the URL when you get an HTTP GET request. So I couldn't do something like http://localhost/../../Users/blah/ to get higher level access.
  5. Ensure you always set the relevant content and mime types.

Good luck - this is a hell of a job.

rein
After working on the web as a site developer, i know well about the security of the matter :) Nice answer though.
Ólafur Waage
Knowing about the risks of exposing executable code to the internet puts you way ahead of most developers. :)
rein
+5  A: 

A web server starts out as being an extremely simple piece of code:

  • open a TCP/IP socket on port 80
  • while not terminated
    • wait for connections on that socket
    • when someone sends you HTTP headers
      • find the path to the file
      • copy the file to the socket

So the outline of the code is easy.

Now, you have some complexities to handle:

  • in the simplest version of the code, while you're talking to one browser, all the others can't connect. You need to come up with some way of handling multiple connections.
  • it's often convenient to be able to send out something more than just a static file (although the first HTTP servers did exactly that) so you need to be able to run other programs.

Handling the possibility of multiple connections is also relatively easy, with a number of possible choices.

  • the simplest version (again, this is the way it was done originally) is to have the code that listens to port 80 set up a specific socket for that connection, then fork a copy of itself to handle that one connection. That process runs until the socket is closed, and then terminates. However, that's relatively expensive: a fork takes tens of milliseconds in general, so that limits how fast you can run.
  • The second choice is to create a lightweight process — a/k/a a thread — to process the request.

Running a program is actually fairly easy too. In general, you define a special path to a CGI directory; a URL that has a path through that directory then interprets the path name as the path to a program. The server would then create a subprocess using fork/exec, with STDOUT connected to the socket. The program then runs, sending output to STDOUT, and it is sent on to the client browser.

This is the basic pattern; everything else a web server does is just adding frills and additional functionality to this basic pattern.

Here are some other sources for example code:


It pretty much does nothing of what you really wanted, but for simple it's hard to beat this one from http://www.commandlinefu.com:

$ python -m SimpleHTTPServer

Charlie Martin
I actually have an even simpler version in python already, amazed how easy it was. It's just sending a static header and content.
Ólafur Waage
Very nice addition to the answer.
Ólafur Waage
A: 

The course I TAed had a proxy assignment so I can kind of shed some light here, I think.

So, you're going to end up doing a lot of header changing just to make your life easier. Namely, HTTP/1.0 is wayyy easier to deal with than HTTP/1.1. You don't want to have to deal with managing timeouts and keep-alives and stuff like that. One connection per transaction is easiest.

You're going to be doing lots and lots of parsing. Parsing is hard in C. I'd advise you to write a function that is something like

int readline(char *buff, int maxLen) {
    while((c = readNextCharFromSocket(&s)) && c != '\n' && i < maxLen)
      buff[i++] = c;
    return i;
}

and handle it one line at a time, solely because it's easiest to use the existing C string functions on one line at a time. Also, remember lines are \r\n separated and headers are terminated with a \r\n\r\n.

The main hard thing will be parsing, so long as you can read files everything else will work as expected.

For debugging, you'll probably want to print out headers that are passed around to sanity test them when stuff breaks.

Alex Gartrell