views:

154

answers:

2

I am working on a tree library, and part of the required functionality, is to be able to search a node for child nodes that match a pattern.

A 'pattern' is a specification (or criteria) that lays out the structure, as well as attributes of nodes in the subtree(s) to be matched.

For example, suppose a tree represents data regarding a particular species of bird. Further assume that the nodes of such a tree have the following attributes:

  • location
  • sex
  • wingspan
  • weight
  • brood_size

Given a parent node, I would like to issue a search in plain English thus:

"Fetch me all male birds that are descendants of this bird, and live in XXX city and have a weight > 100g. Any such bird found should also have at least 2 brothers and one sister, and must itself have at least one child"

< note >

Just to clarify, I do not expect to be able to query using plain English as I have done above. I only used the "plain English query" to illustrate the type of matching I would like to be performing on the tree. I fully expect to use symbols for the matching (as opposed to plain text) in practice.

< /note >

I am thinking of possibly using a regex type pattern matching to match trees. One way would be to have a string representation of each node, so I could use a normal regex - but this is likely of be quite inefficient, as there will be a lot of repeated data - i.e. string representation of child nodes will be supersets of their parent representation, which will be supersets of their parents representational string, and so on, recursively, up the tree - this could very easily become unwieldy for event modestly sized trees - there has to be a better way.

Is anyone aware of an algorithm that will allow me to select nodes (subtrees) in a node, based on a pattern?

Although I asked for a general algorithm, I am implementing this in Python. any snippets that further illustrate such an algorithm (if one can indeed be written), would be immensely useful.

+3  A: 

What's wrong with writing a Lisp Sexpression with wildcards to describe the tree match? Parentheses group a node. Elements from left to right match the root followed by the children. Subtree matches use nested Sexpressions to describe the subtree.

The following would match a tree with arbitrary root node, first child being a leaf A, third child being a subtree rooted with X, first child 1 and third child A:

(?root A ? (X 1 A))

This idea isn't unique to me; the Lisp guys have been writing such patterns since the early sixties.

Here's a LISP pattern matcher (as an example you wanted) that only goes back 20 years: http://norvig.com/paip/patmatch.lisp

However, coding this yourself is pretty easy. This is typically assigned as a homework exercise for people learning LISP.

Ira Baxter
@Ira: thanks!. Its reassuring to know it can be done, and indeed has been done for 20 years... Now to try to implement that in Python :)
morpheous
@morpheous: I'd say "has been done for nearly 50 years".
Ira Baxter
+1 When I read Your requirements (@ morpheus) I fought that sounds like a well suited problem for sql and the lisp approach is what comes closest to that "in software" (without using a db).
Dave
+1  A: 

This depends on your tree. If your tree is rooted and ordered, you should be able to check for an exact match in sublinear time, and if not, you should be able to check for a match in linear time. Several faster algorithms also exist for approximate matching.

For finding material and algorithms for topics like this, Google Scholar is your friend. A search for subtree matching or similar should get you there.

EDIT: Judging by your updated entry, I suggest you take a look at how XPath and similar query languages are implemented. XML is a rooted tree, and XPath can search for sub trees in that tree with complex matching operators like the ones in your example.

I also advice you not to implement this on your own, but rather use an existing library (like PyLucene or some other search engine, which seems appropriate given the example you put out).

Håvard S
@havard: +1 for the link. never heard of google scholar before this...
morpheous
@havard: yeah, I don't believe in reinventing the wheel. Infact I have been playing with the idea of looking at how XPath works ... The node passed to the function fetch_matching_subtrees() IS the root node as far as the function is concerned. I hadn't thought about the Lucene engine though. Food for thought ...
morpheous