views:

217

answers:

2

I'm now in the final stages of upgrading the hierarchy design in a major transactional system, and I have been staring for a while at this 150-line query (which I'll spare you all the tedium of reading) and thinking that there has got to be a better way.

A quick summary of the question is as follows:

How would you implement a hierarchical search that matches several search terms at different levels in the hierarchy, optimized for fastest search time?


I found a somewhat related question, but it's really only about 20% of the answer I actually need. Here is the full scenario/specification:

  • The end goal is to find one or several arbitrary items at arbitrary positions in the hierarchy.
  • The complete hierarchy is about 80,000 nodes, projected to grow up to 1M within a few years.
  • The full text of an entire path down the hierarchy is unique and descriptive; however, the text of an individual node may not be. This is a business reality, and not a decision that was made lightly.
  • Example: a node might have a name like "Door", which is meaningless by itself, but the full context, "Aaron > House > Living Room > Liquor Cabinet > Door", has clear meaning, it describes a specific door in a specific location. (Note that this is just an example, the real design is far less trivial)
  • In order to find this specific door, a user might type "aaron liquor door", which would likely turn up only one result. The query is translated as a sequence: An item containing the text "door", under an item containing the text "liquor", under another item containing the text "aaron."
  • Or, a user might just type "house liquor" to list all the liquor cabinets in people's houses (wouldn't that be nice). I mention this example explicitly to indicate that the search need not match any particular root or leaf level. This user knows exactly which door he is looking for, but can't remember offhand who owns it, and would remember if the name popped up in front of him.
  • All terms must be matched in the specified sequence, but as the above examples suggest, levels in the hierarchy can be "skipped." The term "aaron booze cabinet" would not match this node.
  • The platform is SQL Server 2008, but I believe that this is a platform-independent problem and would prefer not to restrict answers to that platform.
  • The hierarchy itself is based on hierarchyid (materialized path), indexed both breadth-first and depth-first. Each hierarchy node/record has a Name column which is to be queried on. Hierarchy queries based on the node are extremely fast, so don't worry about those.
  • There is no strict hierarchy - a root may contain no nodes at all or may contain 30 subtrees fanning out to 10,000 leaf nodes.
  • The maximum nesting is arbitrary, but in practice it tends to be no more than 4-8 levels.
  • The hierarchy can and does change, although infrequently. Any node can be moved to any other node, with the obvious exceptions (parent can't be moved into its own child, etc.)
  • In case this wasn't already implied: I do have control over the design and can add indexes, fields, tables, whatever might be necessary to get the best results.

My "dream" is to provide instant feedback to the user, as in a progressive search/filter, but I understand that this may be impossible or extremely difficult. I'd be happy with any significant improvement over the current method, which usually takes between 0.5s to 1s depending on the number of results.

For the sake of completeness, the existing query (stored procedure) starts by gathering all leaf nodes containing the final term, then joins upward and excludes any whose paths don't match with the earlier terms. If this seems backward to anyone, rest assured, it is a great deal more efficient than starting with the roots and fanning out. That was the "old" way and could easily take several seconds per search.

So my question again: Is there a better (more efficient) way to perform this search?

I'm not necessarily looking for code, just approaches. I have considered a few possibilities but they all seem to have some problems:

  • Create a delimited "path text" column and index it with Full-Text Search. The trouble is that a search on this column would return all child nodes as well; "aaron house" also matches "aaron house kitchen" and "aaron house basement".
  • Created a NamePath column that is actually a nested sequence of strings, using a CLR type, similar to hierarchyid itself. Problem is, I have no idea how Microsoft is able to "translate" queries on this type to index operations, and I'm not even sure if it's possible to do on a UDT. If the net result is just a full index scan, I've gained nothing by this approach.

It's not really the end of the world if I can't do better than what I already have; the search is "pretty fast", nobody has complained about it. But I'm willing to bet that somebody has tackled this problem before and has some ideas. Please share them!

+2  A: 

take a look at Apache Lucene. You can implement very flexible yet efficient searches using Lucene. It may be useful.

Also take a look at the Search Patterns - What you are describing may fit into the Faceted Search pattern.

It is quite easy implement a trivial "Aaron House Living Door" algorithm, but not sure the regular SVM/classification/entropy based algorithms would scale to a large data set. You may also want to look at the "approximation search" concepts by Motwani and Raghavan.

Please post back what you find, if possible :-)

srini.venigalla
I looked up the "approximation search" and the first result contained the text "NP-hard" - scary! Lucene looks promising; will take me a while to get through, and I'd still prefer a self-contained solution, but this opens up some possibilities, so +1 for your answer as well.
Aaronaught
+3  A: 
Dave
I'll definitely look into this, and +1 for the well-written and well-thought-out answer. I'm not sure how effective it'll be in practice because the leaf nodes are the least-selective (there may be 50,000 "doors"), and this would have to work with subtrees one-by-one as opposed progressively filtering a set. Still a good idea though!
Aaronaught
Thanks for the feedback Aaron. I've edited the post and think You should look into it.
Dave
@Dave - what's the difference between the improved version and the existing approach described in the question? Although the Big-O analysis is a valuable insight either way.
Aaronaught
In the second version you only do 1 binary search for the complete set of nodes, rather than one search per level in the tree. From my point of view, this makes it much easier to code the algorithm and it improves the run time.
Dave
+ As I read Your question both approaches seem pretty similar. That didn't occur to me in the first place, as You didn't give too many details on Your solution. Well, if they are really equal (You can judge better), then I think there is no better solution, than You have right now :-P
Dave
@Dave - Well described. However, I see some rigidity in the design. Information is not always viewed in A particular way. So, the data structure need to be made of loosely related attributes (may or may not be available) supporting searching with loosely specified inputs (may or may not be supplied).
srini.venigalla