views:

1327

answers:

7

Wondering the most efficent way to make a binary search tree into a spell checker by reading in say 1000 word dictionary file and then having it check another document that say has a couple paragraphs.

+5  A: 

a ternary tree trie would be more efficient

Steven A. Lowe
A: 

If you need to do an auto suggest/prefix search as well, then a patricia tree or radix tree is worth looking at.

seanb
A: 

With the example you gave, performance is likely to be irrelevant, since on a PC the whole operation will take approximately 1% of the time it takes the user to read the first result you show, provided you don't use a completely stupid algorithm. But still, I'll assume the problem is big enough that performance is an issue.

If the dictionary file is presorted (as most are), and if the text is small relative to the dictionary as you describe, then I would be sorely tempted to sort the text, perhaps removing duplicates, and then iterate through both lists side-by-side using the same procedure as a merge sort, except you report whether each text word is in the dictionary instead of outputting a merged list.

This does the job in about M log M comparisons for the sort, plus at most N + M comparisons for the iteration, (possibly less, but not complexity-less). That's fairly close to optimal complexity for a one-off operation: to get rid of the linear term in N you need to find ways to not read the whole dictionary from disk at all. I'm pretty sure it's possible to bsearch into the file, especially given that words are quite short, but for small N it's anyone's guess whether seeking about the place will actually be faster than serially accessing the data.

It has the following characteristics:

  • You don't have to hold the dictionary in memory, only the text.
  • Nevertheless, you only make one pass over the dictionary file.
  • You don't do any expensive processing of the dictionary.

Of course if the dictionary file isn't pre-sorted then this doesn't work, and if you can keep the dictionary hanging around in memory for the next spellcheck operation then you can amortise the cost of I/O and of processing it into a tree across several different texts, which will be a win in the long run.

If the dictionary is really huge, then you might benefit from storing it on disk in a pre-processed form equivalent to an unbalanced tree weighted according to the relative frequencies of the various words in your language. Then you can do less than O(N) disk access for small texts, and on most OSs not bother loading it into memory at all, just mmap the file and let the OS worry about it. For a large dictionary, the entire clusters containing words beginning with "dimethyl" need never be touched.

Another consideration is a splay tree for the dictionary. A splay tree unbalances itself as you look things up in it, in order to make frequently-used values quicker to find. Most text uses a small number of words repeatedly, so if the text is long enough to justify the overhead this will win eventually.

Both of the above are subject to Steven A Lowe's point that for strings, a trie beats a normal tree. Don't know whether you'll find an off-the-shelf splay trie, though.

Steve Jessop
I think you've completely overanswered this poor student.
Barry Kelly
Yes, I now see that the assignment is to use a binary search tree, irrespective of what other structures are available. Still, it might be worth knowing some of the real concerns even though the situation is artificial: I certainly don't object to the "trie" answer, even though that's no use here.
Steve Jessop
+1  A: 

If you're just trying to see if a particular word exists in your dictionary (that is, it's spelt correctly), then I don't think a binary search tree is what you're after. A better way to store that information would be in a tree style where each successive node on your tree is one character, and reading the path to the end node gives you the spelling of that word. You'd also need to add a marker to indicate a word-ending.

Eg: say your dictionary has these words: car, cart, cat, cup, cut

- C
  - A
    - R
      - end
      - T
    - T
      - end
  - U
    - P
      - end
    - T
      - end

Checking if a word exists is a matter of looking at each letter individually, and that it exists in the children of the current node.

Check for "cat"
Does "C" exist at the root level? Yes, move to the next letter.
Does "A" exist underneath C? Yes, move on.
Does "T" exist underneath A? Yes, move on.
Is there a word ending after the T? Yes. Word exists.

Check for "cu"
Does "C" exist at the root level? Yes, move to the next letter.
Does "U" exist at the root level? Yes, move to the next letter.
Is there a word ending after the U? No. Word does not exist.

How you store this information is up to you. As Steven pointed out, a Ternary Search Trie might be the way to go: each node would have 27 possible child nodes.

nickf
nice recommendation
MikeJ
+1  A: 

Are you dead-set on using a binary search tree? A Bloom filter would probably be a more efficient data structure.

mipadi
A: 

ya i have to use a binary search tree this is what i have so far but im having problems lol least to say our teacher one of those that says go read you book and im more of a visual learner. But any ways here is what i have so far in my


.h

#pragma once

class IntBinaryTree

{

private:

struct TreeNode

{

char* value; // The value in the node

TreeNode *left; // Pointer to left child node

TreeNode *right; // Pointer to right child node

};

int Size;

TreeNode *root; // Pointer to the root node

// Private member functions

void insert(TreeNode *&, TreeNode *&);

void destroySubTree(TreeNode *);

void deleteNode(char*, TreeNode *&);

void makeDeletion(TreeNode *&);

int NumberNodes(TreeNode *);

int Height(TreeNode *);

public:

// Constructor

IntBinaryTree(){ root = NULL; }

// Destructor

~IntBinaryTree(){ destroySubTree(root); }

// Binary tree operations

void Insert(char*);

void Delete(char*);

bool Search(char*);

int NumberNodes(){NumberNodes(root); return Size;};

int Height(){Height(root); return Size;};

bool TreeBalanced();

};

.cpp


// Implementation file for the IntBinaryTree class

#include

#include "BinarySearchTree.h"

using namespace std;

//********************************************************* // insert accepts a TreeNode pointer and a pointer to a node. * // The function inserts the node into the tree pointed to by * // the TreeNode pointer. This function is called recursively. * //*********************************************************

void IntBinaryTree::insert(TreeNode *&nodePtr, TreeNode *&newNode) { if (nodePtr == NULL) nodePtr = newNode; // Insert the node. else if (newNode->value < nodePtr->value) insert(nodePtr->left, newNode); // Search the left branch else insert(nodePtr->right, newNode); // Search the right branch }

//****************************************************** // Insert creates a new node to hold num as its value, * // and passes it to the insert function. * //******************************************************

void IntBinaryTree::Insert(char* num) { TreeNode *newNode; // Pointer to a new node.

// Create a new node and store num in it. newNode = new TreeNode; newNode->value = num; newNode->left = newNode->right = NULL;

// Insert the node. insert(root, newNode); }

//*********************************************** // destroySubTree is called by the destructor. It * // deletes all nodes in the tree. * //***********************************************

void IntBinaryTree::destroySubTree(TreeNode *nodePtr) { if (nodePtr) { if (nodePtr->left) destroySubTree(nodePtr->left); if (nodePtr->right) destroySubTree(nodePtr->right); delete nodePtr; } }

//*********************************************** // Search determines if a value is present in * // the tree. If so, the function returns true. * // Otherwise, it returns false. * //***********************************************

bool IntBinaryTree::Search(char* num) { TreeNode *nodePtr = root;

while (nodePtr) { if (nodePtr->value == num) return true; else if (num < nodePtr->value) nodePtr = nodePtr->left; else nodePtr = nodePtr->right; } return false; }

//****************************************** // Delete calls deleteNode to delete the * // node whose value member is the same as num. * //******************************************

void IntBinaryTree::Delete(char* num) { deleteNode(num, root); }

//**************************************** // deleteNode deletes the node whose value * // member is the same as num. * //****************************************

void IntBinaryTree::deleteNode(char* num, TreeNode *&nodePtr) { if (num < nodePtr->value) deleteNode(num, nodePtr->left); else if (num > nodePtr->value) deleteNode(num, nodePtr->right); else makeDeletion(nodePtr); }

//******************************************************* // makeDeletion takes a reference to a pointer to the node * // that is to be deleted. The node is removed and the * // branches of the tree below the node are reattached. * //*******************************************************

void IntBinaryTree::makeDeletion(TreeNode *&nodePtr) { // Define a temporary pointer to use in reattaching // the left subtree. TreeNode *tempNodePtr;

if (nodePtr == NULL) cout << "Cannot delete empty node.\n"; else if (nodePtr->right == NULL) { tempNodePtr = nodePtr; nodePtr = nodePtr->left; // Reattach the left child delete tempNodePtr; } else if (nodePtr->left == NULL) { tempNodePtr = nodePtr; nodePtr = nodePtr->right; // Reattach the right child delete tempNodePtr; } // If the node has two children. else { // Move one node the right. tempNodePtr = nodePtr->right; // Go to the end left node. while (tempNodePtr->left) tempNodePtr = tempNodePtr->left; // Reattach the left subtree. tempNodePtr->left = nodePtr->left; tempNodePtr = nodePtr; // Reattach the right subtree. nodePtr = nodePtr->right; delete tempNodePtr; } }

int IntBinaryTree::NumberNodes(TreeNode *nodePtr) { if (nodePtr==NULL) { return(0); } else { Size = NumberNodes(nodePtr->left) + 1 + NumberNodes(nodePtr->right); return Size; } }

int IntBinaryTree::Height(TreeNode *nodePtr) {

if(nodePtr == NULL){
 return 0;}
else{

int ldepth = Height(nodePtr->left) + 1; int rdepth = Height(nodePtr->right) + 1;

if(ldepth > rdepth) return ldepth; else return rdepth; }}

MaincCode.CPP

#include

#include

#include

#include

#include "BinarySearchTree.h"

using namespace std;

#define wrdlen 48

#define linelen 1024

int main() { FILE *ifp; //dictionary file FILE scfp; //file to be spellchecked char infile[wrdlen]; //"words.txt"; char Tinfile[wrdlen]; //"test.txt"; //test file int valid = 0; char newword[wrdlen]; int i = 0; char str[linelen]; char delims[] = {"~!@#$%^&()-_=+[]{}\|;:\'\",.<>/?\n\r\t "}; char *token;

//Creating Node

IntBinaryTree tree;

//ask the user for the dictionary file while(!valid) { printf("Please enter the name of the dictionary file you wish to access\n"); scanf("%s", infile); ifp = fopen(infile, "r");

     if(ifp == NULL)
       printf("sorry, could not find that file!\n");
     else
     {  
         valid = 1;
         printf("Reading file now.........\n");
     }
}   
valid = 0;
//read in all the words and place them into a BST
while(!feof(ifp))
{
    fscanf(ifp, "%s ", newword);
 tree.Insert(newword);
}  

//ask the user for the file to be spellchecked
while(!valid)
{

printf("what is the name of the file you would like to spellcheck?\n"); scanf("%s", Tinfile); scfp = fopen(Tinfile, "r"); if(scfp == NULL) printf("sorry, could not find that file!\n"); else {
valid = 1; printf("Reading file now.........\n"); } }

printf("The following words were not recognized: \n");   

for(i = 1; !feof(scfp); i++)
{
 fgets(str, linelen, scfp);

token = strtok(str, delims);

while( token != NULL ) { tree.Search(token); if (tree.Search(token)) cout << token <<"is found in the tree.\n"; else cout << token << "was not found in the tree.\n";
token = strtok( NULL, delims ); } }

tree.NumberNodes(); tree.Height(); return 0; }


It complies but i cant seem to get my NumberNoded or Height to return a value

and For some reason after it loads the dictionary file it just ends.

A: 

I realize that nickf's idea is very similar to what I wish to implement, anyone experienced able to provide me with some information on how to implement it? Thanks

edude05