Any recommendations for small, lightweight, bag of words search engine?
I have a set of 'documents' that are each basically a small bag of arbitrary words. Given a new document, I need to get a list of 'similar' documents along with some weight for how similar they might be. Documents are likely to be small.. a couple paragraphs at most.
- Stemming would be great but not highly required.
- Word expansion with word nets not required.
- opensource or freeware preferred, as this is a prototype, not a full-blow project.
- unix/linux platform preferred.
I'd be using it as a subcomponent and expect only to feed it documents with an ID and would later do searches for 'similar' documents to one I currently have.