views:

747

answers:

1

I'm attempting to implement a Patricia Trie with the methods addWord(), isWord(), and isPrefix() as a means to store a large dictionary of words for quick retrieval (including prefix search). I've read up on the concepts but they just aren't clarifying into an implementation. I want to know (in Java or Python code) how to implement the Trie, particularly the nodes (or should I implement it recursively). I saw one person who implemented it with an array of 26 child nodes set to null/None. Is there a better strategy (such as treating the letters as bits) and how would you implement it?

+3  A: 

Someone else asked a question about Patricia tries a while ago and I thought about making a Python implementation then, but this time I decided to actually give it a shot (Yes, this is way overboard, but it seemed like a nice little project). What I have made is perhaps not a pure Patricia trie implementation, but I like my way better. Other Patricia tries (in other languages) use just a list for the children and check each child to see there is a match, but I thought this was rather inefficient so I use dictionaries. Here is basically how I've set it up:

I'll start at the root node. The root is just a dictionary. The dictionary has keys that are all single characters (the first letters of words) leading to branches. The values corresponding with each key are lists where the first item is a string which gives the rest of the string that matches with this branch of the trie, and the second item is a dictionary leading to further branches from this node. This dictionary also has single character keys that correspond with the first letter of the rest of the word and the process continues down the trie.

Another thing I should mention is that if a given node has branches, but also is a word in the trie itself, then that is denoted by having a '' key in the dictionary that leads to a node with the list ['',{}].

Here's a small example that shows how words are stored (the root node is the variable _d):

>>> x = patricia()
>>> x.addWord('abcabc')
>>> x._d
{'a': ['bcabc', {}]}
>>> x.addWord('abcdef')
>>> x._d
{'a': ['bc', {'a': ['bc', {}], 'd': ['ef', {}]}]}
>>> x.addWord('abc')
{'a': ['bc', {'a': ['bc', {}], '': ['', {}], 'd': ['ef', {}]}]}

Notice that in the last case, a '' key was added to the dictionary to denote that 'abc' is a word in a addition to 'abcdef' and 'abcabc'.

Source Code

class patricia():
    def __init__(self):
        self._d = {}

    def addWord(self,w):
        d = self._d
        i = 0
        while 1:
            try:
                node = d[w[i:i+1]]
            except KeyError:
                if d:
                    d[w[i:i+1]] = [w[i+1:],{}]
                else:
                    if w[i:i+1] == '':
                        return
                    else:
                        if i != 0:
                            d[''] = ['',{}]
                        d[w[i:i+1]] = [w[i+1:],{}]
                return

            i += 1
            if w.startswith(node[0],i):
                if len(w[i:]) == len(node[0]):
                    if node[1]:
                        try:
                            node[1]['']
                        except KeyError:
                            d = node[1]
                            d[''] = ['',{}]
                    return
                else:
                    i += len(node[0])
                    d = node[1]
            else:
                ii = i
                j = 0
                while ii != len(w) and j != len(node[0]) and \
                      w[ii:ii+1] == node[0][j:j+1]:
                    ii += 1
                    j += 1
                tmpd = {}
                tmpd[node[0][j:j+1]] = [node[0][j+1:],node[1]]
                tmpd[w[ii:ii+1]] = [w[ii+1:],{}]
                d[w[i-1:i]] = [node[0][:j],tmpd]
                return

    def isWord(self,w):
        d = self._d
        i = 0
        while 1:
            try:
                node = d[w[i:i+1]]
            except KeyError:
                return False
            i += 1
            if w.startswith(node[0],i):
                if len(w[i:]) == len(node[0]):
                    if node[1]:
                        try:
                            node[1]['']
                        except KeyError:
                            return False
                    return True
                else:
                    i += len(node[0])
                    d = node[1]
            else:
                return False

    def isPrefix(self,w):
        d = self._d
        i = 0
        wlen = len(w)
        while 1:
            try:
                node = d[w[i:i+1]]
            except KeyError:
                return False
            i += 1
            if w.startswith(node[0][:wlen-i],i):
                if wlen - i > len(node[0]):
                    i += len(node[0])
                    d = node[1]
                else:
                    return True
            else:
                return False

    __getitem__ = isWord

You may have noticed that at the end I set __getitem__ to the isWord method. This means that

x['abc']

will return whether 'abc' in the trie or not.

I think that maybe I should make a module out of this and submit it to PyPI, but it needs more testing and at least a removeWord method. If you find any bugs let me know, but it seems to be working pretty well. Also, if you see any big improvements in efficiency I would also like to hear about them. I've considered doing something about having empty dictionaries at the bottom of each branch, but I'm leaving it for now. These empty dictionaries may be replaced with data linked to the word to expand the uses of the implementation for instance.

Anyway, if you don't like the way I implemented it, at least maybe this will give you some ideas about how you would like to implement your own version.

Justin Peel
Some variable names are too long. Use t, n, and k instead of tmpd, node and ii.
John Machin
@John Yes, but I labeled those that way to make the code easier for other to figure out. I might make those sorts of changes for the final version. Thanks for the input.
Justin Peel
@Justin: Sorry, in my haste I left one out: wlen -> l
John Machin
@John Yes, these are all good for making the size of the .py file smaller (which wasn't really my goal at this point), but how does it really help efficiency?
Justin Peel
@Justin: Please have your irony detector checked out; it appears to be malfunctioning.
John Machin
@John Irony doesn't transmit well through text. You could have been someone truly trying to be helpful like most people on here and I try to respond nicely even if it isn't a great comment/idea. Please don't waste my time with something as stupid as this.
Justin Peel
@Justin, please see http://stackoverflow.com/questions/3121916/python-implementation-of-patricia-tries/3122502#3122502 -- there's a bug in this code of yours, which the OP encountered; I believe I found the bug and posted it in the answer there -- you may want to check my fix and either supply the correct one or confirm that mine is right (and in either case fix the code in this answer, too!-) -- thanks.
Alex Martelli
@Alex, well spotted. I've put in the bug fix. I'm not in a place where I can actually test this very easily at the moment since my motherboard fried this week (using my wife's at the moment and trying not to clutter her computer up while I find a replacement), but it is clearly the problem. I don't even know if this is really a decent implementation of a patricia trie, but it is simple and at least based on a patricia trie.
Justin Peel