views:

401

answers:

1

I need to make a web proxy that adds annotations to the HTML of web pages that pass through.

I'm hoping some decent software exists that can handle the HTTP proxying part of the application, so that, for the most part, I only have to worry about making a function that sits on the stream with a signature similar in intent to

void process(InputStream incomingHTML, OutputStream outgoingHTML);

The ability to redistribute the non-proxy portion of the code under a nonfree license is essential, and paying money for a commercial solution is possible. Also, the processing code needs to call some Java libraries.

So, do you know of anything that suits my needs?

@Stu Thompson: In fact, I am using Tomcat just to mock up behavior right now.

+2  A: 

I have varying familiarity with what I consider are The Big Three open source proxy servers for *nix systems, and each has their own approach to the kind of functionality you are asking for, although I must say I've never done this myself.

  • Squid: Very mature and performant, although single threaded
  • Apache httpd with mod_proxy: what I'm using now for reverse proxy work
  • Varnish Cache: The new kid on the block. Very cool and interesting, but arguably not stable enough for mission critical production systems

BUT, they are each very C/*nix/systems-oriented. So it's pretty straight forward although detailed work to create custom directives or filters or whatever each project might call their approach. But I'd not think any of them would allow for decent, straightforward, fast Java integration. Perl? A C program? Sure...

If you are interested in having your proxy server only do this HTML work, and have no interest in the caching or authentication or whatever functionality that a proper caching server would provide, and your environment allows for it, you may want to consider a simple Java servlet approach:

  1. Your custom Java servlet in a servlet container, like Tomcat or Jetty or whatever, listens for requests,
  2. Uses a client library (like Jakarta's http client) to pass the request on the the destination server,
  3. Receives the response from the destination server, and modifies it,
  4. And then the servlet returns the modified response to the client.

I sure hope you aren't doing anything evil with this system. :P

The first approach seems more 'correct' to me, even with the Java integration issues. The second seems easier, especially if the available skill sets and libraries tie you into a Java-centric approach. Anyway, that is my two cents.

Stu Thompson