views:

689

answers:

6

Can anyone suggest me on what data structure to use for a soundex algorithm program? The language to be used is Java. If anybody has worked on this before in Java. The program should have these features: be able to read about 50,000 words should be able to read a word and return the related words having the same soundex

I don't want the program implementation just few advices on what data structure to use.

+1  A: 

I believe you just need to convert the original strings into soundex keys into a hashtable; the value for each entry in the table would be a collection of original strings mapping to that soundex.

The MultiMap collection interface (and its implementations) in Google Collections would be useful to you.

Jon Skeet
+2  A: 

TIP: If you use SQL as a databackend then you can let SQL handle it with the two sql-functions SOUNDEX and DIFFERENCE.

Maybe not what you wanted, but many people do not know that MSsql has those two functions.

Stefan
+2  A: 

Well soundex can be implemented in a straightforward pass over a string, so that doesn't require anything special.

After that the 4 character code can be treated as an integer key.

Then just build a dictionary that stores word sets indexed by that integer key. 50,000 words should easily fit into memory so nothing fancy is required.

Then walk the dictionary and each bucket is a group of similar sounding words.

Actually, here is the whole program in perl:

#!/usr/bin/perl
use Text::Soundex;
use Data::Dumper;
open(DICT,"</usr/share/dict/linux.words");
my %dictionary = ();
while (<DICT>) {
        chomp();
        chomp();
        push @{$dictionary{soundex($_)}},$_;
}
close(DICT);
while (<>) {
        my @words = split / +/;
        foreach (@words) {
            print Dumper $dictionary{soundex($_)};
        }
}
Edward Kmett
Why the double chomp? (Is it to remove both CR and LF in DOS files?)
Jonathan Leffler
Yeah, without it in DOS it would break otherwise.
Edward Kmett
Er, after writing this I realized that the question was tagged Java. But I'm leaving this here because I think its informative
Edward Kmett
A: 

Since soundex is a hash, I'd use a hash table, with the soundex as the key.

warren
Thanks warren for a precise answer but what abaout the value? a linked list? a binary search tree? an array? or something else? and what about hashmap instead of hash table?
javac
A hashtable *is* a data structure... or at least I was pretty sure it was :) .. with 50000 words, I think *which* structure you pick won't be too horribly important; if you were talking 5000000, then it certainly could be
warren
+1  A: 
class SpellChecker
{

  interface Hash {
    String hash(String);
  }

  private final Hash hash;

  private final Map<String, Set<String>> collisions;

  SpellChecker(Hash hash) {
    this.hash = hash;
    collisions = new TreeSet<String, Set<String>>();
  }

  boolean addWord(String word) {
    String key = hash.hash(word);
    Set<String> similar = collisions.get(key);
    if (similar == null)
      collisions.put(key, similar = new TreeSet<String>());
    return similar.add(word);
  }

  Set<String> similar(String word) {
    Set<String> similar = collisions.get(hash.hash(word));
    if (similar == null)
      return Collections.emptySet();
    else
      return Collections.unmodifiableSet(similar);
  }

}

The hash strategy could be Soundex, Metaphone, or what have you. Some strategies might be tunable (how many characters does it output, etc.)

erickson
A: 

you want a 4-byte integer.

The soundex algorithm always returns a 4-character code, if you use ANSI inputs, you'll get 4-bytes back (represented as 4 letters).

So store the codes returned in a hashtable, convert your word to the code and look it up in the hashtable. Its really that easy.

gbjbaanb