views:

158

answers:

2

I've seen a few sites that list related searches when you perform a search, namely they suggest other search queries you may be interested in.

I'm wondering the best way to model this in a medium-sized site (not enough traffic to rely on visitor stats to infer relationships). My initial thought is to store the top 10 results for each unique query, then when a new search is performed to find all the historical searches that match some amount of the top 10 results but ideally not matching all of them (matching all of them might suggest an equivalent search and hence not that useful as a suggestion).

I imagine that some people have done this functionality before and may be able to provide some ideas of different ways to do this. I'm not necessarily looking for one winning idea since the solution will no doubt vary substantially depending on the size and nature of the site.

+2  A: 

I've tried a number of different approaches to this, with various degrees of success. In the end, I think the best approach is highly dependent on the domain/topics being searched, and how the users form queries.

Your thought about storing previous searches seems reasonable to me. I'd be curious to see how it works in practice (I mean that in the most sincere way -- there are many nuances that can cause these techniques to fail in the "real world", particularly when data is sparse).

Here are some techniques I've used in the past, and seen in the literature:

  1. Thesaurus based approaches: Index into a thesaurus for each term that the user has used, and then use some heuristic to filter the synonyms to show the user as possible search terms.
  2. Stem and search on that: Stem the search terms (eg: with the Porter Stemming Algorithm and then use the stemmed terms instead of the initially provided queries, and given the user the option of searching for exactly the terms they specified (or do the opposite, search the exact terms first, and use stemming to find the terms that stem to the same root. This second approach obviously takes some pre-processing of a known dictionary, or you can collect terms as your indexing term finds them.)
  3. Chaining: Parse the results found by the user's query and extract key terms from the top N results (KEA is one library/algorithm that you can look at for keyword extraction techniques.)
rcreswick
gosh, I wish I'd said that...
MikeJ
+2  A: 

have you considered a matrix of with keywords on 1 axis vs. documents on another axis. once you find the set of vetors representing the keywords, find sets of keyword(s) found in your initial result set and then find a way to rank the other keywords by how many documents they reference or how many times they interset the intial result set.

MikeJ
This is very similar to the first couple steps of LSI (http://en.wikipedia.org/wiki/Latent_semantic_indexing), it seems like it would probably work quite well. (see next comment, out of space)
rcreswick
It would be worthwhile considering how the initial search app does retrieval though, since this may be too similar to the standard vector-space approach to document retrieval (http://en.wikipedia.org/wiki/Vector_space_model). If it duplicates the same logic then the terms won't be of as much value.
rcreswick