views:

86

answers:

4

Hi everyone,

I have an ordered list (a dictionary - 100K words) and many words to seach on this list frequently. So performance is an issue. I know that a HashSet.contains(theWord) or Collections.binarySearch(sortedList, theWord) are very fast. But I am actually not looking for the whole word.

What I want is let's say searching for "se" and getting all the words starts with "se". So is there a ready to use solution in Java or any libraries?

A better example: On a sorted list a quick solution for the following operation

List.subList (String beginIndex, String endIndex) // returns the interval

myWordList.subList(“ab”, “bc”);

Note: Here is a very similar question but accepted answer is not the satisfying. http://stackoverflow.com/questions/2015374/overriding-hashsets-contains-method

+2  A: 

The Trie structure is very well suited for dictionaries and finding words with common prefixes. There is a contribution of a Trie implementation in Google Collections/Guava.

mdma
I checked that out. It looks fine. However I couldn't compile the code. It is depented to some other packages which makes things more complicated. I'll just modify a binary search implementation on strings.
hrzafer
+2  A: 

What you're looking for here is a data structure commanly called a 'trie':

http://en.wikipedia.org/wiki/Trie

It stores strings in a tree indexed by prefix, where the first level of the tree contains the first character of the string, the second level the second character, etc. The result is that it allows you to extract subsets of very large sets of strings by prefix extremely quickly.

David Given
+2  A: 

There's really no big need for new structures: problem can be solved by binary search on your list. In particular, you can modify binary search to return first matching element (first element with specified prefix).

List.subList (String beginIndex, String endIndex) // returns the interval
I may be stupid, but what kind of index has string type? Can you clarify this part?

Nikita Rybak
I just wanted to explain the problem in terms of known methods such as List.subList (int beginIndex, int endIndex)
hrzafer
@hrzafer So, what meaning those parameters have? Is it string prefix and suffix?
Nikita Rybak
Yes, string prefix.
hrzafer
+1  A: 

Your search result will be a range from your ordered word list. To get that, you need the index of the first and the last element of the range.

To get the first, run a binary search with the original search string ("se"), comparing it to the current position in each iteration. Stop when the word at the current position is greater than the search string, but the current-1 th word is lower.

To get the last index, run another binary search on the search term+"z" ("sez"), but now stop only when the word at the current index is smaller than "sez" but current+1 is greater.

Finally return the range marked by the first and last index by whatever means that are available in your programming language.

This method is built on two assumptions:

  • String comparison sees "b" greater than "az"
  • "z" is the highest char value among the list of words

I have this algorithm implemented in a JavaScript data manipulation library (jOrder.net).

Dan Stocker
You really should use Character.MAX_VALUE rather than "z", but besides that, this post pretty much sums it up. Depending on exactly what you're doing, when I've had a problem like this I usually binary search on the prefix, then process using "while (value.get(x).startsWith(prefix))", rather than trying to return a range.
Jay
I totally agree with you on the Character.MAX_VALUE part, but given the 100k magnitude, isn't it better to consider performing log(N) (N=dictionary length) additional string comparisons rather than K (K=result set length)?
Dan Stocker