views:

31

answers:

2

I am at a new company and one of our goals is to implement a document search portal for our team and our clients. I am a bit worried that if we use an external service provider like Salesforce or some other ECM in the cloud there will be a lot of integration work in the future. From a client perspective, these documents will also exist in the same bucket as our structured content (stored in the DB, not a MS Word doc).

If you have implemented document searching, what languages, frameworks, and technologies have you used? Do you have any failure stories? I don't have a problem using something out of the box, but I think it is important that we have control over the documents and the API to access them. I would like to use Rails if we go fully custom.

+2  A: 

Depending on your licensing needs Lucene (LGPL) and Xapian (GPL) both are great, mature, fast search engine API's with bindings for a lot of languages. I've used both of them with great success.

ChristopheD
Lucene is probably the OSS "standard" for document indexing.
David
Good point. But I was thinking about going a step further and using Nuxeo or Alfresco as our back-end public repository. I guess I am wondering if that seems like overkill and Lucene is more flexible way to go. Just don't want to reinvent the wheel...
Bill Brasky
+1  A: 

Hi Bill,

Lucene is probably the safest choice because it is widely used and quite good.

The easiest way to benefit from Lucene is probably with Alfresco, which is a breeze to install, and has Lucene by default. It means you just need to install Alfresco, put your documents in the repository, and you can search for your documents using the powerful web search interface.

If you need to search programmatically, my recommendation is to use Alfresco' CMIS interface, which allows you to search in a REST way. The JCR API is also available.

Nicolas Raoul