views:

75

answers:

1

I am starting a new open source project to develop an application that will provide services to convert various documents into other formats (E.g. doc -> html, pdf -> html, plain text -> html, etc). It will utilize many other open source tools to facilitate the document conversion.

I am looking for a framework that I can use for this purpose. The main requirements of the application are as follows:

  • Provide both a library for direct use, as well as a web service that exposes the underlying library.
  • Provide plugin-oriented service. This means it should allow the tools for use to convert documents to be plugged-in and plugged-out by the clients. This allows tools to convert documents to be added and removed in the future.
  • Provide fallback mechanism. This means it should be able to fall back to use other tools installed if the previous tool used failed to convert the documents. For example, use tool A, tool A failed, use tool B, tool B also failed, use tool C, tool C succeeded, stop and return the results.
  • It should be robust. If a tool collapses, it should not take down the entire application.
  • Failure recovery. Able to restart itself in a catastrophe event.

Anyone have any recommendations on existing frameworks in Java that I can use to satisfy most (if not all) of the above requirements?

Thanks!

PS. I am currently investigating the UIMA (Unstructured Information Management Architecture) framework. I know that UIMA is normally used for natural language processing to retrieve entities of text documents, but on the surface (from reading the manuals - haven't tried anything further), it seems quite good and possibly flexible enough to be tweaked to meet my requirements above. Anyone have any experience with UIMA? Please kindly share your experience (pros and cons), whether it is feasible to be used as a framework of the application based on the requirements listed above.

+1  A: 

Apache Coccoon sounds like the closest to what you're describing, but I've no idea of its failure characteristics. UIMA is most usually used for building text-mining pipelines, which isn't quite what you're describing.

I suspect you'll need to write something yourself. For the plugin aspect you would define an interface and a central abstraction, and then use Spring / Guice / OSGI or similar to manage implementations.

You might find a format identification framework like JHOVE useful.

Jim Downing