I have a dataset of 200million+ records and am looking to build a dedicated backend to power a type ahead solution. Lucene is of interest given its popularity and license type, but I'm open to other open source suggestions as well. I am looking for advice, tales from the trenches, or even better direct instruction on what I will need as far as amount of hardware and structure of software. Requirements:
Must have:
- The ability to do starts with substring matching (I type in 'st' and it should match 'Stephen')
- The ability to return results very quickly, I'd say 500ms is an upper bound.
Nice to have:
- The ability to feed relevance information into the indexing process, so that, for example, more popular terms would be returned ahead of others and not just alphabetical, aka Google style.
- In-word substring matching, so for example ('st' would match 'bestseller')
Note:
- This index will purely be used for type ahead, and does not need to serve standard search queries.
- I am not worried about getting advice on how to set up the front end or AJAX, as long as the index can be queried as a service or directly via Java code.
Up votes for any useful information that allows me to get closer to an enterprise level type ahead solution