views:

187

answers:

2

A customer need a document managment system and I'm building information about this.

I know about sharepoint & alfresco, but in this case I'm evaluating the necesary info for build it from scratch, so please refrain to suggest the use of any of these (we are doing the evaluation of them separately, this is all about develop, not implement a existent solution).

This are the requeriments:

  • Have a very specific requeriment from legal managment of the documents that is specific to our local goverment, but apart from this:
  • A operation similar to google docs from the point of view of the end-user
  • Need store info from 200 + end-users (UPDATE: Are really +700 end-users)
  • Mainly office documents, pdf, text. I already have the extraction of plain text from this binary files.
  • No wiki, no portal creation, barely workflow but very simple, is only managment of files
  • Central repository, share across the company, integrated with the Active directory
  • Fast searching
  • Transparent desktop integration
  • Web interface
  • Multiplataform, if possible

So, this is the things I have on top of my head:

  • Storage: I know that sharepoint save all in the db (Alfresco too?). That is a nightmare, IMHO. I prefer put the metadata in a DB, and the files on disk.

I thinking about force the use of ZFS in this case & leverage their capabilities for versioning, snapshots & scaling. Or maybe use git as storage backend (git will work fine?)

So, where I can know more about how handle a large pool of documents, in ZFS or any regular file system? For example, how layout the folder structure to easy managemnt & fast responses, easy backup, etc.

  • Metadata: I think in a regular DB here, but wonder if have more merit save everything in Lucene (I have some experience on Lucene, but worry because Lucene can't be federated, rigth?).

If I use a search engine as metadata database I can save some work (not need a second pass for indexing), but a regular database engine is more standard.

  • Tech: I probably will build this in Django, PyLucene, Postgress, and do the shell integration for windows (I have not problems for do that).

I will apreciate any hints or info in how properly implement this solution.

+1  A: 
  1. SharePoint and Alfresco are platforms where you can do quite a bit of customization, so even using them really means you are building something.

  2. SharePoint stores blobs in the DB by default, but has ways to put them on a filesystem

  3. If you make it yourself, support the frontpage extensions that Office apps use to communicate with SharePoint and Alfresco, and serve the documents with the right headers that tell IE to start the app. This way you get the same integration to Office apps that SharePoint has (users really love this feature) -- it's just a simple HTTP protocol

  4. If you go with SharePoint, my company as a free document previewer that can view PDF and soon will have Office docs. We sell the underlying tech, but it's Windows only.

  5. I love Django, and use it for all personal projects, but I really think .NET and Java will have more third-party support for the things you need, and much of your code will be portable to SharePoint or Alfresco if you decide to go that way later.

EDIT: More info on #3 as requested

http://blogs.msdn.com/mikefitz/archive/2005/03/14/395112.aspx http://blogs.msdn.com/stcheng/archive/2008/12/17/wss-use-rpc-protocol-to-access-wss-v3-site.aspx Official docs: http://msdn.microsoft.com/en-us/library/ms442469.aspx

Lou Franco
You have some info about where know more on point 3?
mamcx
+1  A: 

Personally I find the "similar to Google Docs" and "Transparent desktop integration" requirements a bit vague, IMHO. But judging from the question you are more concerned about the backend and document storage, and looking more on using a more open source stack (with integration with AD)?

Anyway, personally I'm using KnowledgeTree as our Document Management System and their implementation is that all files resides on a file directory and the database will keep track on the path, corresponding metadata, access logs and versioning information. They basically kept several versions of the same file if a document has been updated - which I think was a fair enough idea implementation wise considering Microsoft Office documents are mostly binary (up until 2003).

You may want to understand how much documents they currently have and how many documents that they are sort of expecting to flow into this system on a daily basis. (Or from a different point of view, what kind of documents they are planning to store would generally give you hints on what kind of load your server is supposed to handle)

My guess is that most likely you could get away with the setup of having local filesystems and database storing metadata stuff unless you are sure that the system is expected to be handling a massive load of documents on a daily basis (imagine being Flickr for documents ;) ).

Seh Hui 'Felix' Leong
Ok, is clear.Yes for a departament solution, not a web service.I know the requeriments look vague, but in fact the expected end-user experience nust be similar to google docs (and that is not very hard) and I have the solution for the desktop integration.Is the back-end implementation that I need more info
mamcx