views:

867

answers:

6

I've been working on the solution for financial industry. The main functionality of the application is the ability to load massive input files, digest them, update state in persistent store and generate extracts from persistent store on request. Pretty straightforward.

The input files are industry standard formatted XML large (more that hundreds of megabytes) messages containing many repeated entries. The persistent storage is relational database. The engine has been implemented as POJO-based (Spring Framework as back-bone) Java application deployable on J2EE application server.

The question is about the scalability and performance of the solution. If the application processes entries from XML in sequence the scalability of the solution is rather poor. there is no way to engage more than one instance of the application into the processing of the single file. This is why I've introduced parallel processing for entries form input XML file. Basically the idea is to dispatch processing of individual entries for workers from the pool. I decided to use JMS for dispatching. The component that loads the file reads the stream and simply extracts single entries and feeds the dispatching queue. There is a number of concurrent consumers on the other end of the queue. Each picks one message of the queue and processes the entry and it's immediately available to process other entry. This is pretty similar to servlets within the web container. What I found particularly powerful about this approach is that the workers can reside within separate instances of the application deployed on remote servers as long as the queue is shared. Unfortunately all workers connect to the same database that maintains persistence storage and this might be a bottleneck if database server is not powerful enough to handle load from concurrent workers.

What is your opinion on this architecture? Did you have similar application to design? What was your design choice then?

+1  A: 

I think the architecture is generally sound. If the database is having trouble dealing with a high number of concurrent updates from the workers, you could introduce a 2nd queue on the other "side" of the app: as each worker completes their task, they add the results of that task to the queue. Then a single worker process periodically grabs the result objects from the 2nd queue and updates the database in a large batch operation? That would reduce database concurrency and might increase the efficiency of updates.

cliff.meyers
In a multi-tier system that you suggest, Pregzt will have to be careful about transactional integrity- if machine holding the queue crashes, for example, [s]he may lose some data. JMS does include transaction awareness, but the performance characteristics of that are implementation dependent.
joev
Actually I use XA glabal transactions that span across JMS Session and JDBC Connection. So, everything is transactional. Furthermore JMS messages are also flagged as persistent ones. Having this I can assume once and only once delivery characteristics.
pregzt
You could also use a tool like Terracotta which can transparently mirror the state of your JVM heaps to the hard disk and recover from system crashes.
cliff.meyers
+3  A: 

You can also have a look at Hadoop, a very handy platform for Map/Reduce jobs. The huge advantage is, that all infrastructure is provided by Hadoop, so you only apply new hardware nodes to scale. Implementing the Map and Reduce jobs should be only done once, after this, you can feed you cluster with massive load.

Mork0075
Maybe next time :) The application has already been implemented as described above. I don't want to re-implement the whole thing and introduce new programming model for developers. But Hadoop or GridGain are the frameworks I would definitely investigate.
pregzt
+1  A: 

Also, take a look at Terracota clustering solution.

Alexander Temerev
+1  A: 

For parallel processing, as Mork0075 said, hadoop is a great solution. Actually many companies are use it for very large log analysis. And an interesting project Hive have been build based on hadoop for data warehousing.

Anyway, I think your current design is quite scalable. As for your concern about all of workers hitting on the database, you can just put another messaging queue between workers and database. Workers put processing results in the queue, and you build another program to subscribe to the queue and update the database. The drawback is that two queues might make system too complicated. Of course you can just add another topic to the existing MQ system. That will make system more simpler. Another approach is use a shared file system, such as NFS, each worker machine mount the same directory on the shared file server, and each worker write its processing results into a separate file on the shared file server. Then you build a program to check new files to update database. In this approach you introduce another complexity: shared file server. You can judge which one is more simpler in your case.

yanky
+1  A: 

I recently spend some of my spare time investigating Spring Batch 2.0. This is new version of Java batching engine based on Spring framework. Guys who implemented Spring Batch concentrated on concurrency and parallelization of execution for this release. I must say it looks promising!

pregzt
A: 

If you already are using Spring/JEE, it is only natural to apply Spring Batch as a solution for your "concurrence architecture".

Two benefits right of the bat:

  1. Spring Batch (starting from 2.0) implements partitioning, that means that the framework will take care of partitioning data for you in separate partition steps ( StepExecution ), and delegating the actual execution of these steps to multiple threads or other distributed systems ( PartitionHandlers, e.g. TaskExecutorPartitionHandler or to be more distributed MessageChannelPartitionHandler, etc.. )

  2. Spring has a nice OXM package for dealing with XML + Spring Batch has a StaxEventItemReader that extracts fragments from the input XML document which would correspond to records for processing

Give Spring Batch a try. Let me know if you have any questions, I'll be glad to help out.

litius