I've been working on the solution for financial industry. The main functionality of the application is the ability to load massive input files, digest them, update state in persistent store and generate extracts from persistent store on request. Pretty straightforward.
The input files are industry standard formatted XML large (more that hundreds of megabytes) messages containing many repeated entries. The persistent storage is relational database. The engine has been implemented as POJO-based (Spring Framework as back-bone) Java application deployable on J2EE application server.
The question is about the scalability and performance of the solution. If the application processes entries from XML in sequence the scalability of the solution is rather poor. there is no way to engage more than one instance of the application into the processing of the single file. This is why I've introduced parallel processing for entries form input XML file. Basically the idea is to dispatch processing of individual entries for workers from the pool. I decided to use JMS for dispatching. The component that loads the file reads the stream and simply extracts single entries and feeds the dispatching queue. There is a number of concurrent consumers on the other end of the queue. Each picks one message of the queue and processes the entry and it's immediately available to process other entry. This is pretty similar to servlets within the web container. What I found particularly powerful about this approach is that the workers can reside within separate instances of the application deployed on remote servers as long as the queue is shared. Unfortunately all workers connect to the same database that maintains persistence storage and this might be a bottleneck if database server is not powerful enough to handle load from concurrent workers.
What is your opinion on this architecture? Did you have similar application to design? What was your design choice then?