tags:

views:

414

answers:

1

Hi I am trying to create a trie structure for an english to spanish word dictionary.

Here's what I have so far:

struct s_trie_node
{
    char * translation; /* NULL if node not a word */
    char * word;

    /* pointer array to child nodes */
    struct s_trie_node * children[UCHAR_MAX+1];
};

int add_word(const char * word, char * translation) { 
    /* TODO: add word to trie structure
    If word exists, append translation to existing string
    Be sure to store a copy of translation, since
    the string is reused by load_dictionary()
    */
    struct s_trie_node * curr = proot;
    char * currLetter = word;
    while (currLetter != '\0') {
        while ((curr -> children) != NULL) {
            char * currChildLetter = ((curr -> children) -> word);
            char * copyWord = word;
            while (copyWord == currChildLetter) {
                copyWord++;
                currChildLetter++;
            }
            if (currChildLetter == '\0') {
                curr = (curr -> children);
                break;
            }
            (curr -> children)++;
        }
        currLetter++
    }
}

I don't know where to go from here. Any help would be appreciated. Thanks!

+2  A: 

Well, I think you are taking too large a bite out with your add_word function. Try and break it down into smaller problems first. Once you solve the smaller problems, the larger problem will likely become easier.

First, we have to actually create Trie nodes (trying to do this in add_word would be ugly). So, let's now make a function that does this:

/* Allocates, initializes, and returns a new Trie node. The node will contain 
 * a copy of word and trans, rather than use them directly. The children array
 * will be initialized to all NULL's.
 */
struct s_trie_node * trie_node_create(const char * prefix, const char * trans)
{
    struct s_trie_node * n = malloc(sizeof(struct s_trie_node));
    int i;

    n->word = prefix ? strdup(prefix) : strdup("");
    n->translation = trans ? strdup(trans) : NULL;
    for (i = 0; i < UCHAR_MAX + 1; i++)
        n->children[i] = NULL;
    return n;
}

You should note that we are creating copies of the strings, rather than using them directly. This makes life easy on users of this Trie library, and also allows us to free them when they are no longer needed, without worry if the user is using them elsewhere. However, it is an important decision, as it means we are responsible for ensuring these strings are freed later on. Also, we are using strdup, which means we are making the assumption that the strings passed to us are "clean" (ie. terminated with a NULL character).

Anyways, so now we can create nodes. Let's move onto more Trie-related problems. Clearly, you are going to need to be able to find out the length of the common prefix of 2 strings. If you can't do this, you can't do anything else. So, we can use the following function:

/* Returns length of common prefix of v & w. */
int match(char * v, char * w)
{
    char * start = v;
    for (; *v && *v == *w; v++, w++);
    return v - start;
}

This is pretty basic, but is a necessity. When we compare a word against a prefix node, knowing the length of the common prefix will tell us whether it is an exact match or a partial match. Exact matches means we just need to update the node. Partial matches may result in the child node having to be "split" into 2 and will most likely mean we have to go further down the Trie. This idea of splitting nodes is crucial. If there is only one word in the list, eg. "hello", then there will only be 2 nodes: the root node (empty string) and the root node's only child "hello". If we now wish to add another word that shares a common prefix with "hello", eg. "hey", we will need to split "hello" into 2 nodes: "he", a child of the root node, and "llo", a child of "he". So, let's create a function that will handle splitting nodes for us:

/* Creates a new node that is a child of n. The word stored at n will be 
 * truncated after location (index into n->word), with the remaining suffix 
 * of the word belonging to the new child of n.
 */
struct s_trie_node * trie_node_split(struct s_trie_node * n, int location)
{
    struct s_trie_node * child;
    char * prefix;
    char * suffix;
    int len = strlen(n->word);

    if (location <= 0)
        return NULL;
    if (location >= len)
        return n;

    prefix = strndup(n->word, location);
    suffix = strndup(n->word + location, len - location);

    child = trie_node_create(suffix, n->translation);
    memcpy(child->children, n->children,
        sizeof(struct s_trie_node *) * UCHAR_MAX);
    free(n->word);
    n->word = prefix;
    n->translation = NULL;
    n->children[0] = child;
    n->children[1] = NULL;
    return n;
}

With the ability to find the length of the common prefix between 2 strings, create nodes, and split nodes, we have all the basic operations needed to manipulate and traverse our Trie.

Now, recursion often flows really well with Trie structures. So, pretend you are given a trie (the root node) and a word to match in the Trie. This word will either share a common prefix with one of our children or it won't. If it doesn't, then we can simply create a new node whose value is this word and add it to our list of children. However, if it does, then we run into several different cases, depending on how long the common prefix was.

Case 1: The word matches with our child exactly (that is, the words are the same). In this case, our child is an exact match to this word, and we can simply update the translation and stop (no need to create any new nodes).

Case 2: The word, in its entirety, is a prefix of our child. In this case, we need to split the child into 2 parts; the first being our word, the second being the rest of the word previously stored at our child. The first part becomes the new child, and we store the translation in it, the second part becomes a child of our child.

Case 3: Our child, in its entirety, is a prefix of the word. In this case, we remove the common prefix from word (shorten word to only the suffix). We then add the word's suffix to the sub-tree rooted at our child (ie. recurse).

Case 4: The common prefix is shorter than both words. In this case, we need to first split the child. The prefix becomes the new child, with the suffix as the child's child. We then remove the prefix from the word, and then add the rest of the word to the sub-tree rooted at our child (ie. recurse).

And that's all 4 cases. Armed with this, we can now easily write a function to handle each of these case, using recursion to traverse down the trie.

/* Add a translation to the Trie rooted at root. */
int trie_add_word(struct s_trie_node * root, char * word, char * trans)
{
    struct s_trie_node ** n;
    int loc;

    for (n = root->children; *n; n++) {
        /* Find length of longest common prefix. */
        loc = match(word, (*n)->word);

        if (!loc) {
            continue;

        } else {
            if (loc != strlen((*n)->word))
                trie_node_split(*n, loc);

            word += loc;
            if (!*word) {
                if ((*n)->translation)
                    free((*n)->translation);
                (*n)->translation = strdup(trans);
                return 0;
            }

            return trie_add_word(*n, word, trans);
        }
    }

    /* Failed to find any children that matched. */
    if (n - root->children >= UCHAR_MAX) {
        fprintf(stderr, "Ran out of room to store children in.");
        return -1;
    }
    *n = trie_node_create(word, trans);
    return 0;
}

And that's it! Long answer, I suppose, but it was fun.

tixxit
I should also mention, you should really store your children in a linked list; each node should have a pointer to its sibling and a pointer to its FIRST child only. This will make traversal easier and will also remove the UCHAR_MAX thing you have going on.
tixxit