views:

552

answers:

3

Hello all,

This question is regarding data storage systems such as CouchDB, HDFS and HBase, specifically, which is right.

I am looking at making a simple and customized Document Management System for my organization. Basically, we need the ability to store some Word Documents, PDFs and other similar files. I also want to store metadata about these files (e.g., Author, Dates, etc). Usage permissions would also be handy, but that can probably be built using meta-data. I would also need the ability to full-text index. The ability to version, while not required would be extremely useful.

I would like the ability to simply add hardware to expand the resources of the system and the system must support Network Attached Storage over the CIFS or NFS protocol(s).

I have read about CouchDB, HDFS and HBase. My preferred programming language is C# as all of my end-users will be running Windows machines and I will want to make both web and winforms client implementations.

My question is which solution best fits my needs?

Based on my research it appears that CouchDB (utilizing the CouchDB-Lounge and CouchDB-Lucene) perfectly fits my needs. However, I am worried that since I have worked with CouchDB that I might be overlooking something useful for my needs in HDFS or HBase or something similar due to a bias.

Any and all opinions are welcome as I am looking for the community input as I really do not want to make the wrong choice at the start of my project. Please ask if you need more information.

I thank you all for your time, input and assistance.

+3  A: 

Since you do not want an off-the -shelf tool such as Sharepoint, obviously you have a lot of coding to do regardless of which database you choose. Full-text search of the content of your PDF, Word, etc. files will probably require some programming.

Of course, CouchDB documents are not the same as Microsoft Word "documents" but the ideas are similar. Generally, CouchDB is a good candidate for a CMS back-end, with JSON objects storing metadata, and having one or more Word, PDF, Excel, etc. file attachments.

(Remember, CouchDB's so-called "versioning," is an internal detail; it is not a general foundation for building an application-level versioning features.)

jhs
couchdb-lucene can parse the contents of .doc and .pdf (and many more) files.
Jan Lehnardt
A: 

You may be interested in Lily, a CMS which is backed by HBase, among other technologies: architecture diagram at http://outerthought.org/lily/377-OTC/version/1/part/ImageData/data/lily-diagram-950px.png, project home at http://outerthought.org/lily/index.html.

Jeff Hammerbacher
A: 

There are some updates on the development of lily: http://outerthought.org/blog/blog/395-OTC.html

Thomas Koch