I have a system that is supposed to take large files containing documents and process these to split up the individual documents and create document objects to be persisted with JPA (or at least it is assumed in this question).
The files are in the range of 1 document to 100 000 in each file. The files come in various types
- Compressed
- Zip
- Tar + gzip
- Gzip
- Plain-text
- XML
Now the biggest concern is that the specification forbids accessing local files. At least in the way that i'm used to.
I could save the files to a database table, but is that really a good way to do it? The files can be up to 2GB and accessing the files from the database would require that you download the whole file, either into memory or onto disk.
My first thought was to separate this process from the application server and use a more traditional approach, but i've been thinking about how to keep it on the application server for future purposes such as clustering etc.
My questions are basically
- Is there a standard way or a recommended way of dealing with this in Java EE?
- Is there an application server specific way around this?
- Can you justify breaking this process out of the application server? And how would you design the communications channel between these two separate systems?