views:

101

answers:

6

Hi there,

I'm receiving from socket A and writing that to socket B on the fly (like a proxy server might). I would like to inspect and possibly modify data passing through. My question is how to handle border cases, ie where the regular expression I'm searching for would match between two successive socket A read and socket B write iterations.

char buffer[4096]
int socket_A, socket_B

/* Setting up the connection goes here */

for(;;) {

    recv(socket_A, buffer, 4096, 0);

    /* Inspect, and possibly modify buffer */

    send(socket_B, buffer, 4096, 0);

    /* Oops, the matches I was looking for were at the end of buffer,
     * and will be at the beginning of buffer next iteration :( */

}
+1  A: 

My suggestion: have two buffers, and rotate between them:

  1. Recv buffer 1
  2. Recv buffer 2
  3. Process.
  4. Send buffer 1
  5. Recv buffer 1
  6. Process, but with buffer 2 before buffer 1.
  7. Send buffer 2
  8. Goto 2.

Or something like that?

Douglas Leeder
+1  A: 

Assuming you know the maximum length M of the possible regular expression matches (or can live with an arbitrary value - or just use the whole buffer), you could handle it by not passing on the full buffer but keep M-1 bytes back. In the next iteration put the new received data at the end of the M-1 bytes and apply the regular expression.

If you know the format of the data transmitted (e.g. http), you should be able to parse the contents to know when you reached the end of the communication and should send out the trailing bytes you may have cached. If you do not know the format, then you'd need to implement a timeout in the recv so that you do not hold on to the end of the communication for too long. What is too long is something that you will have to decide on your own,

dseifert
A: 

Basically, the problem with your code is that the recv/send loop is operating on a lower network layer than your modifications. How you solve this problem depends on what modifications you're making, but it probably involves buffering data until all local modifications can be made.

EDIT: I don't know of any regex library that can filter a stream like that. How hard this is going to be will depend on your regex and the protocol it's filtering.

Dave
+1  A: 

In that sense you're talking about (and all senses for, say, TCP) sockets are streams. It follows from your question that you have some structure in the data. So you must do something similar to the following:

  1. Buffer (hold) incoming data until a boundary is reached. The boundary might be end-of-line, end-of-record, or any other way that you know that your regex will match.
  2. When a "record" is ready, process it and place the results in an output buffer.
  3. Write anything accumulated in the output buffer.

That handles most cases. If you have one of the rare cases where there's really no "record" then you have to build some sort of state machine (DFA). By this I mean you must be able to accumulate data until either a) it can't possibly match your regex, or b) it's a completed match.

EDIT: If you're matching fixed strings instead of a true regex then you should be able to use the Boyer-Moore algorithm, which can actually run in sub-linear time (by skipping characters). If you do it right, as you move over the input you can throw previously seen data to the output buffer as you go, decreasing latency and increasing throughput significantly.

dwc
+1  A: 

You need to know and/or say something about your regular expression.

Depending on the regular expression, you might need to buffer a lot more than you are buffering now.

A worst case scenario might be something like a regular expression which says, "find everything, starting from the begining up until the first occurence of the word 'dog', and replace that with something else": if you have a regular expression like that, then you need to buffer (without forwarding) everything from the begining until the first occurence of the word 'dog': which might never happen, i.e. might be an infinite amount to buffer.

ChrisW
A: 

One alternative is to use poll(2)-like strategy with non-blocking sockets. On read event grab a buffer from the socket, push it onto incoming queue, call the lexer/parser/matcher that assembles the buffers into a stream, then pushes chunks onto the output queue. On write event, take a chunk from the output queue, if any, and write it into the socket. This sounds kind of complicated, but it's not really once you get used to the inverted control model.

Nikolai N Fetissov