views:

184

answers:

3

Consider the following site:

http://maps.google.com

It has a main text input, where the user can type business, countries, provinces, cities, addresses and zip codes. I wonder which is the best way to implement a search like this. I realize that probably Google Maps uses a full text search with all kinds of data in the same table, and it has a chance of having a parser which classifies the input (i.e. between numeric, like zip codes and coordinates, and textual, like business and addresses).

With the data spread in many tables and systems, a parser is essential. The parser could be built from regular expressions, or could be built with IA tools like Artificial Neural Networks and Genetic Algorithms.

Which approach would you recommend?

+1  A: 

I'd have the data in one database. If the data got to big or I knew it would be huge, I'd assign an id to each business, address etc, then have other tables which reference this data.

Regular Expressions would only be necessary if the user could define what they want to search for:

business: Argos

But then what happens if they want an Argos in Manchester (Sorry, I'm English), maybe then get the location of the user based on their IP but what happens if they say:

business: Argos Scotland

Now you don't know if the company has two words, or if there is a location next to it. All of this has to be taken into consideration.

P.s Sorry if that made no sense.

James Brooks
I do not pretend to teach the user a syntax to use my form. But I will upvote your answer because I may use your solution in the future (in another application).
Jader Dias
+3  A: 

It might be best to aggregate the data from all of your tables into a search index. Lucene is a free search engine, similar to how Google's search engine works (inverted index), and it should allow you to search by any of those values or any combination of them with relative ease.

http://lucene.apache.org/java/docs/

Lucene comes with its own query language (again, very similar to Google's or any other Internet search sites syntax). The only drawback of using something like Lucene is you would need to build its index. You wouldn't be querying your database directly (which could get very complicated...inverted index are pretty much designed for what your trying to do), so you need to periodically gather up new information from your database and add it to your index. It might also be necessary to rebuild your index to remove unneeded data.

With Lucene, you get a pretty flexible query syntax that most people are familiar with (because pretty much everyone searches the internet), it performs very well, and is not terribly complicated. By using Lucene, you avoid the hit of using regular expressions (which are not the most performant text searching mechanism), and you don't have to write your own parser. Should be a win-win, aside from a little learning curve to build a Lucene index generator and figure out how to query that index.

jrista
I've been considering Lucene, but also other full text search engines (ie. SQL Server full text index and Oracle Text). But I'll count your answer as a vote for full-text-index-only approach.
Jader Dias
I've used SQL Server full text index, and it has a lot to be desired. It has limited querying capability bound by the FREETEXT and CONTAINS functions, and quite often provides very quirky and inconsistent results. If you have a tremendous amount of information, it seems to work better (hundreds of thousands to millions of rows)...anything less, and their free-text engine has a lot of trouble. Even with large volume, Lucene provides a much more accurate index. As for Oracle...couldn't say anything there, as I've never used their text indexing.
jrista
A: 

You will need to pre process the query before doing a full text search on it. If you are using a GIS database, then you will already have columns like city, areacode, country etc. Convert your query into tokens seperated on space or commas, or both. Then hit individual columns to see match. This way you will know what part of the query is the city, the areacode etc.

You could also try some naive approximation approaches,example - 6 consecutive numbers will probably be an area code. Look for common words like "road" , "restaurant" , "street" etc which will be part of many queries and then use some approximation to figure out what they are looking for. Hope this helps.

Ritesh M Nayak