views:

121

answers:

4

Any recommendations for small, lightweight, bag of words search engine?

I have a set of 'documents' that are each basically a small bag of arbitrary words. Given a new document, I need to get a list of 'similar' documents along with some weight for how similar they might be. Documents are likely to be small.. a couple paragraphs at most.

  • Stemming would be great but not highly required.
  • Word expansion with word nets not required.
  • opensource or freeware preferred, as this is a prototype, not a full-blow project.
  • unix/linux platform preferred.

I'd be using it as a subcomponent and expect only to feed it documents with an ID and would later do searches for 'similar' documents to one I currently have.

A: 

Solr or Sphinx. They aren't exactly lightweight but I wouldn't recommend anything smaller, if the project turns out to be successful and it needs to grow, switching the search engine might be painful.

Mauricio Scheffer
Can you use Sphinx without a database (MySQL or Postgresql) i.e. feed it directly with files?
Pascal Thivent
yes, using the xmlpipe2 source: http://www.sphinxsearch.com/docs/current.html#xmlpipe2
Mauricio Scheffer
Yeah, I saw that. But are all files xml formatted? My point is that Sphinx is a solution made to index data from a table or XML. It's not a solution for non structured data outside a database.
Pascal Thivent
Just wrap your documents with the xml needed... it's the same with Solr (except that Solr has Tika for processing binary docs)
Mauricio Scheffer
If you have questions about Solr or Sphinx I recommend that you create a real question instead of posting them as comments...
Mauricio Scheffer
A: 

I think that Lucene is an option. It should allow you to build a custom bag of words search engine.

Pascal Thivent
A: 

Whoosh is a pure Python (no C, no external database) indexer / search engine. Check out the documentation for more information. It does support stemming.

I tried it out on an XML dump of a mediawiki instance and it seemed to work pretty well!

lost-theory
A: 

I wonder about MongoDB http://www.mongodb.org/display/DOCS/Home

It seems like 'full-text-search' may be what I'm after... and having additional fields to search with may be handy.

ericslaw