Need to host 80 million tiff files (1000 KB each) some where around 10 Terrabytes, what would be the best Document Management solution. These files need to be on a filesystem but want to indexed thru the Document Management system ( Sharepoint, Documentum, Filenet etc). We already have indexes in CSV format and want to reuse those indexes instead of crawling thru the 80 million files and recreating the indexes.
I think it would be best to transfer the indexes to a database such as SQL Server and keep the files in the filesystem. A DMS (File upload/access etc) can then be build upon these indexes.
I would have looked at something like Hadoop. It is possible to run Hadoop on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)]. Have a look at Hadoop referring to an an example of how The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not including bandwidth).[14]
SharePoint 2010 can handle document libraries that large - and it can be done under WSS3/MOSS2007 editions with some careful planning and architecting.
I am not really familiar with Documentum, but in SharePoint land I would create a custom content type that maps the fields of your CSV into fields for SharePoint, then provision one (or more, break it up however makes sense) document libraries using the new type. With that much data I would seriously consider breaking it up into multiple site collections and/or give a look at the Remote Blob Storage API: http://technet.microsoft.com/en-us/magazine/2009.06.insidesharepoint.aspx