Given:
- 1 database per client (business customer)
- 5000 clients
- Clients have between 2 to 2000 users (avg is ~100 users/client)
- 100k to 10 million records per database
- Users need to search those records often (it's the best way to navigate their data)
Possibly relevant info:
- Several new clients each week (any time during business hours)
- Multiple web servers and database servers (users can login via any web server)
- Let's stay agnostic of language or sql brand, since Lucene (and Solr) have a breadth of support
For Example:
Joel Spolsky said in Podcast #11 that his hosted web app product, FogBugz On-Demand, uses Lucene. He has thousands of on-demand clients. And each client gets their own database.
They use an index per client and store it in the client's database. I'm not sure on the details. And I'm not sure if this is a serious mod to Lucene.
The Question:
How would you setup Lucene search so that each client can only search within its database?
How would you setup the index(es)?
Where do you store the index(es)?
Would you need to add a filter to all search queries?
If a client cancelled, how would you delete their (part of the) index? (this may be trivial--not sure yet)
Possible Solutions:
Make an index for each client (database)
- Pro: Search is faster (than one-index-for-all method). Indices are relative to the size of the client's data.
- Con: I'm not sure what this entails, nor do I know if this is beyond Lucene's scope.
Have a single, gigantic index with a database_name field. Always include database_name as a filter.
- Pro: Not sure. Maybe good for tech support or billing dept to search all databases for info.
- Con: Search is slower (than index-per-client method). Flawed security if query filter removed.
One last thing:
I would also accept an answer that uses Solr (the extension of Lucene). Perhaps it's better suited for this problem. Not sure.