views:

111

answers:

6

Hi,

Is it a good idea to store words of a dictionary with 100.000 words in a static array of string. I'm working on spellchecker and I thought that way would be faster.

+5  A: 

Definitely its not a good idea to store so many strings as an array especially if you are using it for spell check which means you will have to search for and compare strings. It would make it inefficient to search or compare a string through the array as it would always be a linear search

Gopi
Almost anyone would alphabetize and binary search on an array.
sje397
+1: Big O is no always relevant, but when n = 100k, O(n) equals "slow as hell".
delnan
rightly said @delnan !
Gopi
100k isn't all that large an N; the problem isn't a linear search through 100k entries, but that each entry is going to require a string comparison.+1 to using a HashSet, at this scale. Consider using either a Trie, a B+Tree, a ISAM-Tree, or a String-BTree once you start looking at an order or two of magnitude larger than this.
Recurse
A: 

I think 100 000 is not so large amount that search wolud be inefficent. Of course it depends ... It would work nice if you are checking if a word exists in array - it's a linear complexity algorithm. You can keep table ordered so you can use quicksort search algoritm and make it more efficent.

On the other hand - if you wold like to find, 5 most likely words (using N-gram method or something) you should consider using Lucene or other text database.

peperg
+5  A: 

You should generally prefer a Java Collections Framework class to a native Java array for anything non-trivial. In this particular case, what you have is a Set<String> (since no words should appear more than once in the dictionary).

A HashSet<String> offers constant time performance for the basic operations add, remove, and contains, and should work very well with String hashcode formula.

For larger dictionaries, you'd want to use more sophisticated data structures specialized for storing a set of strings (e.g. a trie), but for 100K words, a HashSet should suffice.

See also

polygenelubricants
+1 for mentioning Trie, since it's an efficient data structure to use within the mentioned problem field, i.e. spell checking.
Giu
For trie implementation in Java: http://stackoverflow.com/questions/623892/where-do-i-find-a-standard-trie-based-map-implementation-in-java
polygenelubricants
A: 

Perhaps using an SQLite database would be more efficient ? I think that's what firefox/thunderbird does for spell checking but I'm not entirely sure.

Xeross
A: 

You won't be able to store that amount of strings in a static variable. Java has a size limit for static code and even method bodies. Simply use a flatfile and read it upon class instanciation - Java is faster than most people think with those things.

See http://stackoverflow.com/questions/2546470/enum-exeeding-the-65535-bytes-limit-of-static-initializer-whats-best-to-do.

Daniel Bleisteiner
dear downvoter: although I agree that this is not a very elegant answer, I do think that a downvote to a serious (non-troll) answer deserves a comment.
seanizer
+1  A: 

How about an approach with in memory database technology like for example sqlite inmemory This allows you to use efficient querying without disk overhead

dwergkees