views:

378

answers:

6

I want to write a web proxy for exercise, and this is the code I have so far:


// returns a map that contains the port and the host
def parseHostAndPort(String data) {
    def objMap // this has host and port as keys
    data.eachLine { line ->
        if(line =~ /^(?i)get|put|post|head|trace|delete/) {
            println line
            def components = line.split(" ")
            def resource = components[1]
            def colon = resource.indexOf(":")
            if(colon != -1) {
                URL u = new URL(resource)
                def pHost = u.host
                def pPort = u.port
                return (objMap = [host:pHost,port:pPort])
            }
            else {
                return (objMap = [host:resource,port:80])
            }
        }
    }
    return objMap
}

// reads a http request from a client
def readClientData(Socket clientSocket) {
    def actualBuffer = new StringBuilder()
    InputStream inStream = clientSocket.inputStream
    while(true) {
        def available = inStream.available()
        if(available == 0)
        break;
        println "available data $available"
        def buffer = new byte[available]
        def bytesRead = inStream.read(buffer,0,available)
        actualBuffer << new String(buffer)
    }
    return actualBuffer.toString()
}

def sock = new ServerSocket(9000)
sock.reuseAddress = true
while(true) {
    sock.accept { cli ->
        println "got a client"
        def data = readClientData(cli)
        def parsed = parseHostAndPort(data)
        def host = parsed["host"]
        def port = parsed["port"]

        println "got from client $data"

        def nsock = new Socket(host,port)
        nsock << data // send data received from client to the socket
        nsock.outputStream.flush() 
        def datax = readClientData(nsock)
        println "got back $datax"
        cli << datax // send the client the response
        cli.outputStream.flush()
        cli.close()
    }
}

Right now, all it does is :

  • read the HTTP request my browser sends

  • parse the host and port

  • connect to that host, and write the data received from the client

  • send the client back the data received from the host

But ... it doesn't work all the time. Sometimes it will make a good request, sometimes not. I think it's a buffering issue, I'm not sure. The thing is, I added flush calls, and still nothing.

Can you spot what I'm doing wrong?

EDIT:

  • I noticed that if I add some sleep calls, the proxy seems to "work" on a higher number of requests, but not all of them.
  • to collect the bounty, help me find out what I'm doing wrong. What's the normal "algorithm" used for a web proxy? Where am I deviating from it? Thanks!
A: 

I suggest you familiarise yourself with the HTTP protocol specification. HTTP is more complicated than a single request-response over a separate TCP connection - i.e. your implementation will fail if either the client or the server tries to use persistent connections.

cmeerw
I understand what you're saying, but this is not the case.
Geo
Http spec still worth a read, if you're trying to implement plumbing such as a proxy...
AviD
A: 

Could there be a race condition in readClientData(Socket)? It looks like you are immediately checking whether data is available, but it is possible that the data has not yet been received; you will simply drop out of the loop rather than wait for the first data to be received.

Jonathan
how would I wait until the data is available?
Geo
I'm not sure what the best method would be for HTTP 1.1 (which allows persistent connections), but for HTTP 1.0, you can just read until you hit the end of the stream.
Jonathan
A: 

Is the client socket blocking? If so, you may want to try non-blocking I/O or set a socket timeout.

coledot
how can you do non-blocking I/O in java?
Geo
Check out java.nio.channels.SocketChannel
Jonathan
+4  A: 

Jonathan was on the right track. The problem is partly your use of available(). The method available doesn't say "is it done?" it says "is there currently any data available?". So immediately after you've made your request there won't be any data available, and depending on network timing that might happen during processing too, but it doesn't mean that no more is coming, so your break is premature.

Also, the InputStream.read(byte[] ...) family of methods is always allowed to return fewer bytes than you ask for. The array length or offset,length pair constrains the maximum, but you can always get less. So, this code of yours:

    def buffer = new byte[available]
    def bytesRead = inStream.read(buffer,0,available)
    actualBuffer << new String(buffer)

could create a big array, but then only get it half full of data in the read, but still append the full buffer (with its trailing unread array elements) onto the String.

Here's a revision that relies on the fact that InputStream.read(...) will never return unless it's end of stream or there's some data available (but not necessarily as much as you asked).

// reads a http request from a client
def readClientData(Socket clientSocket) {
    def actualBuffer = new StringBuilder()
    InputStream inStream = clientSocket.inputStream
    int bytesRead = 0;
    byte[] buffer = new byte[16 * 1024];
    while((bytesRead = inStream.read(buffer)) >= 0) { // -1 on EOF
        def bytesRead = inStream.read(buffer,0,bytesRead); // only want newly read bytes
        actualBuffer << new String(buffer)
    }
    return actualBuffer.toString()
}

That said, you've got a few other problems too:

  • you're pulling the whole response into memory, when you should be copying it in a byte-pump-loop directly into the client's response output stream (what happens if it's a multi gigabyte response)
  • you're using Strings to store binary data -- which assumes that all the bytes work fine in the default CharacterEncoding, which might be true in UTF-8 or US-ASCII, but isn't going to work with other locales
Ry4an
thanks for the tips. However,if I don't check for the available data, the `readClientData` blocks until I hit stop on my browser. And if I check for available data, I'm right back to where I started.
Geo
Checking for .available() is definitely incorrect -- it makes no guarantees and is not okay for loop control.@cmeerw was pointing out that your browser isn't closing the socket because it's leaving it open for connection re-use. Your proxy should detect end of request not by waiting for EOF (or .available() == 0) but by correctly parsing HTTP Requests.Unbodied requests (GET, HEAD, etc.) end with "`\r\n\r\n`" and bodied requests (POST, PUT, etc.) end according to provided content lengths and boundaries. You need to actively read your client requests to know when they end.
Ry4an
Yikes! You're also entirely forgetting to send back the HTTP response headers received from the server. They're not contained in the `nSock.inputStream`. They provide the browser with both the status code (important) and the Content-Length which helps the browser know that no more data is coming.
Ry4an
Not to mention all the other important headers! For example, Set-Cookie, Location (for 302), Authentication-Required, and more...
AviD
Hrm, since you're going a Socket rather than the (better) URLConnection to the requested URL you are passing back headers, so the problem is definitely in the requests you incorrectly terminate the receiving of.
Ry4an
+1  A: 

Ry4an makes some good points. If you want to see how a small but perfectly formed proxy is constructed, look at Tiny HTTP Proxy which is written in Python - you can see all the issues which need to be addressed, and it would be fairly straightforward to port the code to Groovy. I've used the proxy for test purposes and it works well.

Vinay Sajip
+3  A: 

First, it's really difficult to know what exactly is going wrong here - "Sometimes it will make a good request, sometimes not." doesn't really describe what's happening when the problem occurs!!

That said, I was still able to figure out what's going wrong for you.

As you've said already, you're looking for the most basic solution that'll work consistently, so I'll avoid anything unnecessary or getting into the efficiency or otherwise of your code. Also, I'll give you the answer first and then describe what's causing the problem (it's long, but worth reading :)

Solution

The simple answer to your problem is that you need to do some HTTP protocol parsing to figure out if all of the data has been sent by the client and not rely on what available() or read() return. How much of a PITA this is depends on how completely you wish to support the HTTP protocol. To support GET requests, it's pretty easy. It's a little harder to support POSTs that specify a content length. It's much harder to support "other" encoding types (e.g. chunked or multipart/byteranges see http://tools.ietf.org/html/rfc2616#section-4.4).

Anyway, I assume you're just trying to get GETs working, so to do that, you have to know that HTTP headers and bodys are separated by an "empty line", that HTTP's line delimeter is \r\n and that GETs do not have a body. Therefore a client has finished sending a GET request when it transmits \r\n\r\n.

Some code like this should handle GETs consistently for you (code is untested but it should get you to at least 90%):

def readClientData(Socket clientSocket) {

    def actualBuffer = new StringBuilder()
    def eof = false;

    def emptyLine = ['\r', '\n', '\r', '\n']
    def lastEmptyLineChar = 0

    InputStream inStream = clientSocket.inputStream
    while(!eof) {
        def available = inStream.available()
        println "available data $available"

        // try to read all available bytes
        def buffer = new byte[available]
        def bytesRead = inStream.read(buffer,0,available)

        // check for empty line: 
        //    * iterate through the buffer until the first element of emptyLine is found
        //    * continue iterating through buffer checking subsequent elements of buffer with emptyLine while consecutive elements match
        //    * if any element in buffer and emptyLine do not match, start looking for the first element of emptyLine again as the iteration through buffer continues
        //    * if the end of emptyLine is reached and matches with buffer, then the emptyLine has been found
        for( int i=0; i < bytesRead && !eof; i++ ) {
            if( buffer[i] == emptyLine[lastEmptyLineChar] ){
                lastEmptyLineChar++
                eof = lastEmptyLineChar >= emptyLine.length()
            }
            else {
                lastEmptyLineChar = 0
            }

        }

        // changed this so that you avoid any encoding issues
        actualBuffer << new String(buffer, 0, bytesRead, Charset.forName("US-ASCII"))
    }
    return actualBuffer.toString()
}

For POSTs, you need to add to this by also looking for the String "Content-length: " and parsing the value after this. This value is the size of the HTTP body (i.e. the bit that comes after the /r/n/r/n end of header mark) in octals. So when you encounter the end of header, you just need to count that number of octals of bytes and you know that the POST request has completed transmission.

You'll also need to determine the type of request (GET, POST etc.) - you can do this by inspecting the characters transmitted before the first space.

Problem

Your problem is that your readClientData function doesn't always read all of the data sent by the client. As a result, you're sometimes sending a partial request to the server and the returns some kind of error. You should see incomplete requests printed to standard out if you replace

println(new String(buffer))

with

println(avaliable)

in the readClientData function.

Why is this happening? It's because available() only tells you what's currently available to be read from the InputStream and not whether or not the client has sent all the data it's going to send. An InputStream, by it's very nature, can never actually tell whether or not there will be more data (the exception to this is if there is no more underlying data to read - e.g. a socket is closed, the end of the array or file has been reached, etc. - this is the only time read() will return -1 (i.e. EOF)). Instead it's up to higher level code to decide whether it should read more data from the stream and it makes this decision based on application-specific rules that apply to the application-specific data being read by the InputStream.

In this case, the application is HTTP, so you need to understand the basics of the HTTP protocol before you'll get this working (cmeerw, you were on the right track).

When a HTTP request is made by a client, the client opens a socket to the server and sends a request. The client only closes the socket as a result of a timeout, or the underlying network connection being disconnected, or in response to user action that requires that the socket is closed (application is closed, page is refreshed, stop button pushed etc.). Otherwise, after sending the request, it just waits for the server to send a response. Once the server has sent the response, the server closes the connection [1].

Where your code succeeds, data is being provided by the client quickly and consistently enough so that the InputStream receives additional data between your invocation of read() and your subsequent invocation of available() on the next iteration of the loop (remember that InputStream is being provided with data "in parallel" to your code that's invoking its read() method). Now in the other case, where your code fails, no data has yet been provided to InputStream, so when your code invokes available(), InputStream correctly returns 0 since no further data has been provided to it since you invoked read() and therefore it has 0 bytes available for you to read(). This is the race condition that Johnathan's talking about.

Your code assumes that when available() returns 0 that all data has been sent by the client when, in fact, sometimes it has, and sometimes it has not (so sometimes you get a "good request" and other times not :).

So you need something better than available() to determine wheter or not the client has sent all of the data.

Checking for EOF when you invoke read() (see R4an's answer [2]) isn't suitable either. It should be clear why this is the case - the only time read() is supposed to return EOF (-1) is when the socket is closed. This isn't supposed to happen until you've forwarded the request to the target proxy, received a response and sent that response to the client, but we know it can also exceptionally be closed by the client. In fact you're seeing this behaviour when you run the sample code - the proxy hangs until the stop button is clicked in the browser, causing the client to close the connection prematurely.

The correct answer, which you now know, is to do some parsing of the HTTP and use that to determine the state of the connection.

Notes
[1] It's beyond a proof of concept proxy, but since it was touched on already, if the HTTP connection is "keep-alive" the server will keep the connection open and wait on another request from the client
[2] There's an error in this code that causes the readClientData mangle the data:

byte[] buffer = new byte[16 * 1024];
while((bytesRead = inStream.read(buffer)) >= 0) { // -1 on EOF
    def bytesRead = inStream.read(buffer,0,bytesRead); 
    actualBuffer << new String(buffer)
}

The second inStream.read() invocation completely overwrites the data read by the first invocation of inStream.read(). Also bytesRead is being redefined here (not familiar enough with Groovy to know whether or not this would be an error). This line should either read:

bytesRead = bytesRead + inStream.read(buffer,bytesRead,buffer.length()-bytesRead);

or be removed entirely.

Alan Donnelly
thanks for the very detailed answer.
Geo