views:

154

answers:

2

I am currently implementing a BK-Tree to make a spell checker. The dictionary I am working with is very large (millions of words), which is why I cannot afford any inefficiencies at all. However, I know that the lookup function that I wrote (arguably the most important part of the entire program) can be made better. I was hoping to find some help regarding the same. Here's the lookup that I wrote:

public int get(String query, int maxDistance)
{
    calculateLevenshteinDistance cld = new calculateLevenshteinDistance();
    int d = cld.calculate(root, query);
    int tempDistance=0;

    if(d==0)
        return 0;

    if(maxDistance==Integer.MAX_VALUE)
        maxDistance=d;

    int i = Math.max(d-maxDistance, 1);
    BKTree temp=null;

    for(;i<=maxDistance+d;i++)
    {
        temp=children.get(i);
        if(temp!=null)
        {
            tempDistance=temp.get(query, maxDistance);
        }
        if(maxDistance<tempDistance)
            maxDistance=tempDistance;
    }

    return maxDistance;
}

I know that I am running the loop an unnecessarily large number of times and that we can trim the search space to make the lookup faster. I'm just not sure how to best do that.

A: 

I am not too sure about the efficiency of your lookup. As some of the features like root selection is not clear, whether you are going to read all the million words every time you startup etc..

This is an alternative way to implement spelling checker

As per the blog link you shared... I have some suggestions ...

In order to index and search our dictionary, we need a way to compare strings...

Use Apache Lucene for that. It provides a way to index entries and stores it on the disk.

To build the tree from a dictionary, take an arbitrary word and make it the root of your tree

No need of doing this. Lucene provides various types of queries (TermQuery, BooleanQuery, FuzzyQuery etc) to perform that.

Facility for fuzzy search is required

Yup it is there too (Lucene Fuzzy Search)

Index creation is a one time activity in lucene (assuming that your dictionary is not dynamic) so you don't have to read the dictionary each time. Lucene will just read its optimized index.

Other than Lucene you can refer to this article for writing a good spell checker. The original article is in python but there are some java version available or you can write one. It talks about all the problems and some possible solutions for a good spell checker.

Favonius
The OP isn't asking about full text indexing, they're asking about spell checking.
Nick Johnson
I cannot use Lucene or anything similar, everything has to be done by my code. Also, I need to read the dictionary every time I start up.
efficiencyIsBliss
A: 

Your loop looks generally correct, if a little byzantine. Your attempt to refine the stopping condition (with tempdistance/maxdistance) is incorrect, however: the structure of the BK-tree requires that you explore all nodes within levenshtein distance d-k to d+k of the current node if you want to find all the results, so you can't prune it like that.

What makes you think you're exploring too much of the tree?

You may find my followup post on Levenshtein Automata instructive, as they're more efficient than BK-trees. If you're building a spelling checker, though, I'd recommend following Favonius' suggestion and checking out this article on how to write one. It's much better suited to spelling correction than a naive string-distance check.

Nick Johnson
@Nick I was aware of the d-k to d+k part and I implemented it, but it gave me incorrect results, which was why I got rid of it completely. That's why I was so sure that I wasn't trimming the search space efficiently. Could you explain that part a bit more here? Do the d and k remain constant or do they change with every iteration down the tree?
efficiencyIsBliss
'k' is the threshold, and remains constant. 'd' is the distance between the search term and the current node, and depends on the node you're evaluating.
Nick Johnson
To reduce the search space, can we change k to mirror the minimum distance found so far? If we know that the first word we looked at was at a distance of 5 from our word, then there is no point in looking at words that may be at a distance of 6 or higher, right?
efficiencyIsBliss