views:

432

answers:

5

I have a system that is supposed to take large files containing documents and process these to split up the individual documents and create document objects to be persisted with JPA (or at least it is assumed in this question).

The files are in the range of 1 document to 100 000 in each file. The files come in various types

  • Compressed
    • Zip
    • Tar + gzip
    • Gzip
  • Plain-text
  • XML
  • PDF

Now the biggest concern is that the specification forbids accessing local files. At least in the way that i'm used to.

I could save the files to a database table, but is that really a good way to do it? The files can be up to 2GB and accessing the files from the database would require that you download the whole file, either into memory or onto disk.

My first thought was to separate this process from the application server and use a more traditional approach, but i've been thinking about how to keep it on the application server for future purposes such as clustering etc.

My questions are basically

  1. Is there a standard way or a recommended way of dealing with this in Java EE?
  2. Is there an application server specific way around this?
  3. Can you justify breaking this process out of the application server? And how would you design the communications channel between these two separate systems?
+1  A: 

accessing the files from the database would require that you download the whole file, either into memory or onto disk.

This is not entirely true. You are not forced to put the whole thing in an indermetiating byte[] or so. You can just keep using streams. Get an InputStream of it using ResultSet#getBinaryStream() and immediately handle it the usual way, e.g. writing to HttpServletResponse#getOutputStream(). The cost is only the buffer size which you can define yourself.

Is there a standard way or a recommended way of dealing with this in Java EE?

Either the database or a fixed disk file system path with r/w access for the appserver. E.g. /var/webapp/files on the root disk.

BalusC
Thank you for pointing out the obvious flaw in my question. I've not dealt much personally with ResultSet, but this was very interesting.
Peter Lindqvist
+1  A: 

I think the healthiest way to do it is to do without a Java application server.

Application servers like to manage resources (CPU, memory, threads) their own way. Performing long-running, I/O intensive batch processing is prone to distorting this kind of resource management.

I suggest using an external process to split up the files, with a periodical tidying up to keep disk usage under control, and using the AS for reading access via file-system the way BalusC suggested.

I suppose concurrent access issues would be dealt with by JPA layer -- which I admittedly don't know much about, but I think it comes also in J2SE flavour.

AndreaG
Have you got personal experience with this 'distorting' you mention? I think I understand the theoretical concept, but i'm having a hard time translating it into some sort of practical scenario.
Peter Lindqvist
Actually I have used AS for quite standard stuff: light-weight beans, servlets etc. I don't have experience using very large MDBeans, JCA or TIBCO as the other posts suggest. Anyway, ASs I've worked with start with a fixed maximum amount of memory. If you perform memory-hungry operations in your thread, I don't know what happens of the other threads (for example the garbage collector).
AndreaG
Furthermore, blocking I/O operations keep a thread (which in this model is a pooled, and hence limited, resource) busy: you are relying on the OS to be kind; if it doesn't (keeps lots of files open and waiting), your resources (threads) get congested. Of course, it depends also on what triggers the processing: Message Driver Beans or some other elaborate mechanism constitutes some kind of congestion control.
AndreaG
+1  A: 

Is there a standard way or a recommended way of dealing with this in Java EE?

I'd use a real integration layer (as in EAI) for this purpose, running as an external process. Integration tools (ETL, EAI, ESB) are specifically designed to deal with... integration and many of them provide everything required out of the box (simplified version: transport, connectors, transformation, routing, security).

Basically, when dealing with files, a file connector is used to monitor a directory for incoming files which are then parsed/split them into messages (applying optionally some transformations) and sent to an endpoint for business processing.

Have a look at Mule ESB for example (has a File Connector, supports many transports, can be run as a standalone process). Or maybe Spring Integration (coupled with Spring Batch?) which has File and JMS Adapters too. But I don't have much experience with it so I can't really say anything about it. Or, if you are rich, you could look at Tibco EMS, WebMethods, etc. Or build your own solution using some parsing library (e.g. jFFP or Flatworm).

Is there an application server specific way around this?

I'm not aware of anything like this.

Can you justify breaking this process out of the application server? And how would you design the communications channel between these two separate systems?

As I said, I'd use an external process for the file processing stuff (better suited) and send the content of the file as messages over JMS to the app server for the business processing (and thus benefit from JEE features such as load balancing and transaction management).

Pascal Thivent
The Spring Integration looks very interesting! Thanks!
Peter Lindqvist
+1  A: 

The specification forbids accessing files using java.io. There are other legal ways to access files, e.g. via a DataSource/JDBC driver, or via a resource connector.

See pp545 of "JSR 220: Enterprise JavaBeansTM,Version 3.0 EJB Core Contracts and Requirements"


... using JDBC for file access. Could you please explain it a bit more in detail?

A file is a data store in the same way that a database is. It's a pretty good data store for serially accessed, unstructured, character data, and not so great when you want transaction safety, multi-user access, writable random-access, or structured binary data. In an enterprise system you tend to have at least one of the latter set of requirements nearly all of the time.

Although it's not strictly true to say "In an enterprise system there are no files" (because there are log files and nearly all databases use files at a low level) it's a pretty good design rule-of-thumb, because of all of the problems that data files cause in a high performance, multi-user, transaction-safe, read-write, enterprise system.

Unfortunately the business world is full of business data stored in files. You have to deal with them. Some files (e.g. Excel spreadsheets) have enough in common with a simple database that they can be worth accessing through a JDBC driver. I've never heard of anyone accessing plain text files through a JDBC driver, but you could - or you could use a more generic resource adapter instead (according to the EJB3 specification, JDBC is a resource manager API).

richj
Yes I've seen this come up, perhaps i'm stupid, but i cannot grasp the concept of using JDBC for file access. Could you please explain it a bit more in detail?
Peter Lindqvist
Thanks for the reference!
Peter Lindqvist
+2  A: 

I sketch here a few more propositions and consider the following concerns:

  • scalability (file size, clustering, etc.)
  • batch architecture (job recovery, error handling, monitoring, etc.)
  • compliance with J2EE

With JCA

JCA connectors belong to the JEE stack and permit inboud/outboud connectivity from/to the EJB world. JDBC and JMS are usually implemented as JCA connector. An inbound JCA connector can use thread (through the worker abstraction) and transactions. It can then forward any processing to a message-driven bean (MDB).

  • write a JCA connector that polls for new file, then process them and delegate further processing to message-driven bean in a synchronous way.
  • the MDB can then persit the information in database with JPA
  • the JCA connector has control over the transaction, and several MDB invocations can be in the same transaction
  • file system is not transactional so you will somehow need to figure out how to deal with error such as faulty input files
  • you can probably use streaming (InputStream) all along the pipleline

With plain threads

We can achieve more or less the same as the JCA way, using threads that are launched from a web servlet context listener (or evt. an EJB Timer).

  • The thread polls for new file, if file is found it processes it and delegates further processing to regular SLSB in a synchronous way.
  • Thread in web container have access to UserTransaction and can control the transaction
  • EJB can be local so that InputStream is passed by reference
  • Deployment of the web module + ejb can be done with an ear

With JMS

To avoid the need of having several concurrent polling threads and the problem of job acquision/locking, the actual processing can be realized asynchronously using JMS. JMS can also be interesting to split the processing in smaller tasks.

  • A periodic task polls for new file. If file is found, a JMS message is queued.
  • When the JMS message is delivered, the file is read and processed and the information is persisted in database with JPA
  • if JMS processing fails, the app. server may retries automatically or put the message in the dead message queue
  • monitoring/error handling is more complicated
  • you can probably use streaming

With ESB

Many projects have emerged in the past year to deal with integration: JBI, ServiceMix, OpenESB, Mule, Spring integration, Java CAPS, BPEL. Some are technologies, some are platform, and there is some overlap between them. They all have a wagon of connectors to route, transform and orchestrate message flow. IMHO, the message are suppose to be small piece of information, and it may be hard to use these technologies to process your large data file. The website patterns of enterprise application integration is an excellent website for more information.

IMO, the approach that fits best the JEE philosophy is JCA. But the effort to invest is relatively high. In your case, the usage of plain thread that delegate further processing to SLSB is maybe the easiest solution. The JMS approach (close to the proposition of P. Thivent) can be interesting if the processing pipelie gets more complicated. Using an ESB seems overkill to me.

ewernli
Do you have any good online resources or book tips on JCA? Very good summary btw! I do somehow like the idea of having multiple choices, but sometimes life is easier where there is one obvious better choice.
Peter Lindqvist
The best doc about JCA to start with is "Creating Resource Adapter with J2EE Connector Architecture 1.5". The corresponding code can be found in the J2EE samples which come with the SDK. http://developers.sun.com/appserver/reference/techart/resource_adapters.pdf
ewernli
I refined my answer. Plain thread in web container + ejb is maybe just enough in your case and would be relatively easy. It is stil an all-in-container solution.
ewernli