I have implemented the following data structure:
class Node(object):
"""Rules:
A node's child is ONLY an iterable of nodes
A leaf node must NOT have children and MUST have word
"""
def __init__(self, tag, children=[], word=u""):
assert isinstance(tag, unicode) and isinstance(word, unicode)
self.tag=tag
self.word=word
self.parent=None #Set by recursive function
self.children=children #Can only be iterable of nodes now
for child in self.children:
child.parent=self
def matches(self, node):
"""Match RECURSIVELY down!"""
if self.tag == node.tag:
if all( map( lambda t:t[0].matches(t[1]), zip( self.children, node.children))):
if self.word != WILDCARD and node.word != WILDCARD:
return self.word == node.word
else:
return True
return False
def __unicode__(self):
childrenU= u", ".join( map( unicode, self.children))
return u"(%s, %s, %s)" % (self.tag, childrenU, self.word)
def __str__(self):
return unicode(self).encode('utf-8')
def __repr__(self):
return unicode(self)
So a tree is basically a bunch of these nodes connected together.
I am parsing S-Expression, like this: (VP (VP (VC w1) (NP (CP (IP (NP (NN w2)) (VP (ADVP (AD w3)) (VP (VA w4)))) (DEC w5)) (NP (NN w6)))) (ADVP (AD w7)))
So I am interested in writing matching a subtree with a bigger tree. The catch is, the subtree has wildcard characters, and I would like to also be able to match these characters.
For example: If given a subtree,
(VP
(ADVP (AD X))
(VP (VA Y))))
The operation which "matches" both of them should return { X:W3, Y:W4 }
Anyone here able to recommend an effecient, simple solution?