I'll list what I'm planning to do and need pointers on how to go about building it.
I have millions of business records. New businesses are added everyday. Every time a new Business is added, we need to determine if the particular business already exists. We query our database and search for businesses with matching keywords as entered by the user. The query is on multiple columns and we return the best 10 matches based on the number of tokens that match.
Example:
Existing information :
Listing 1 :
Business Name : Spacely Space Sprockets
Address: Ring 325, Satellite 63, Outer Space, Galaxy X271
Listing 2 :
Business Name : Fred Freaking Flasks
Address: #456, Bedrock, Stone Cave, Earth
Consider my database has the above mentioned records. Now, a user comes to add a new listing and he enters :
Business Name: Space Ventura Quentin Tarantino
Address: God Father Street, Kill Stone, Outer Mafia, Folsom Prison
Now, my search would see that the new record has matches in the existing listing 1 and listing 2.
In listing 1, the Business Name column matches one of the keyword(space) in the newly entered business name. The Address columns of both Listing 1 as well as Listing 2 have one match each(listing 1 has outer while listing 2 has stone) in the newly added listing.
Since Listing 1 has 2 matches in the newly entered data, Listing 1 would be displayed above Listing 2 as a duplicate suggestion.
This is what I want to do. Please remember the data would be in the range of 10 to 15 million records to start with and hope to reach 50 million over a period of time. Your help would be greatly appreciated. Sorry about the long post!