views:

430

answers:

5

We have to convert EVERYTHING to images for archiving purpose. DOC, HTML, email, ZIP, PDF, TXT and any document you can read/view on computer. In addition, it must recursive convent on all embed attachment and files in zip.

I know ImgMaker only. Is it the best or I can have something better? My boss ask me to search that are there any alternative other then ImgMaker.

Any open source or profit suggestion are welcome.

A: 

Uh? How do you expect to convert a zip archive to an image? What should the pixels show? Should it be lossless, so you can convert back? If it's for archiving, I would guess that is a requirement, but it sounds weird.

unwind
For zip, we do convert all files in zip. Yes, it is a requirement. From my boss point of view: The definition of a file from IT guy is not important, business information inside does. No data lost, that's the objective from my boss.
Dennis Cheung
But ... There will be huge amounts of data loss if you do this, surely? If you do it for a document, you will only get the "rendered" version of the document, which loses all the structure and so on, it won't be possible to go back. Insane.
unwind
A: 

What's going to happen to the tiff images afterwards? Assuming you want to manage them in some way, it seems to me you'd be better off looking for some complete documentation management product that can take these doc types as input and manage/archive the (presumably) large number of images that you'll have.

Otherwise you would seem to be re-inventing the wheel.

If you want open-source, something like Alfresco

Note the server based transformation feature below

Alfresco offers one integrated repository to manage all formats of content across image management, document management, web content management and email repositories. The repository is a modern platform with:

  • One Repository for any Digital Asset
  • The industry’s most scalable, standards-based, JSR-170 content repository
  • Standards support for JSR-170, Web Services and REST
  • High-Availability, Fault Tolerance and Scalability – Auto failover and clustering
  • Secure Distributed Capture over Web Services, HTTP and HTTPS
  • Reuse of Alfresco Business Policy Rules
  • Server-based transformation between many formats including TIFF, JPEG, GIF, PNG, MS-Office, PDF and FLASH
  • Metadata Extraction and Management
  • Automatic Classification Framework
Paul
We can't. Due to legal reason and lifecycle of a document format (can anyone open winword 1.x file?).It must be standard images or PDF. (but I will need a reader for PDF, image does not)
Dennis Cheung
So convert everything into images. That was my point in highlighting Alfresco's ability to convert formats.
Paul
Thank for your suggestion. I'll take a look on Alfresco. The actual problem to me is: We do have our own dirty devil wheel, and I have to keep it run. None of us has any right or role to replace it.
Dennis Cheung
I just checked Alfresco, docs and their source. I am sorry but it cannot help me. It does use POI, HSSF and POIFS. It only support very limited source and target mMimetype(e.g. *.MSG only can extract *.TXT). Too many information will be lost.Thanks anyway.
Dennis Cheung
Oh well :( There are other commercial suppliers (eCopy is one that I believe is strong in image handling), I don't know a lot about the specifics, though
Paul
A: 

The question as asked cannot be answered sensibly. One obvious solution is to simply rename each file by attaching .tiff. E.g. you could get ringtone.mp3.tiff. Insane as it is, there are not many better ways to convert an .mp3 to a .tiff.

Note that this is not an IT problem. The business is assuming everything is an image, and music is the trivial example of something that isn't.

( To clarify - this was assuming an automated setting, e.g. to archive incoming email for legal reasons. If that's required, you MUST archive incoming MP3's too. If you've got humans in the loop, this question would not belong on a programming forum. )

MSalters
I think it's clear what is meant. "any document you can read/view on computer" is in the question.
Paul
+1  A: 

I don't know if this will help, because it sounds like you want something totally automated, but there are many pseudo-printer drivers that can create TIFF images as output. For example:

http://sourceforge.net/projects/pdfcreator/

Mark Ransom
We've thought about it. It's not that difficult. We'll have own full control and ability to fine tuning. But then we'll create another own wheel for automation and embed attachment recursion. We'll be trapped with new problem(e.g. unexpected popup).
Dennis Cheung
We do not want no focus on "how to make image". We wish to let these issue to the experts. What we want is "I give you a file, and it return some TIFF pages or the reason of why the conversion was failed".
Dennis Cheung
+1  A: 

There is a whole industry built around this type of function and numerous service providers that charge a fee per document to do this type of conversion. You are better off buying than building it on your own.

The idea of converting Everything is fundamentally a fool's errand as you would need a single program that could render every file type ever created (in essence recreating every piece of software that ever wrote a data file AND recreating ever version of each). Also, not every file format has a format that has a direct rendered form. For example, what do you do with a database file, a DLL,an XML file, a WAV file?

If you are looking for something that does a reasonable job for a large number of formats, there are two main players with OEM toolkits, but both are extremely expensive and neither supports the Java platform directly. I use the former if you have any additional questions.

Stellent (now Oracle) OutsideIn: http://www.oracle.com/technologies/embedded/outside-in.html

Autonomy KeyView: http://www.autonomy.com/content/Products/idol-modules-keyview-viewing/index.en.html

JohnFx