views:

114

answers:

5

Hi,

I've got about 2500 short phrases in a file. I want to be able to find phrases as I type possible substrings of them. My app has a text box and a list of phrases. The text box is initially empty and the list contains all 2500 phrases, since the empty string is a substring of all of them. As I type in the text box, the list updates so that it always only contains phrases which contain the text box's value as a substring.

At the moment I have one of Google's Multimaps, specifically:

LinkedHashMultimap<String, String>

with every single possible substring mapped to its possible matches. This takes a while to load (about a second) and I think it must be taking up quite a bit of space (which may be a concern in the future.) It's very fast with the lookups though.

Is there a way I could do this with some other data structure or strategy that would be quicker to load and take less space (possibly at the expense of the speed of the lookups)?

+4  A: 

You'll want to look into using the Trie data structure.

matt b
I think a trie won't help if the substrings can start at any offset with in a phrase.
Michael Borgwardt
@Michael you just need to put all of the substrings into the trie including ones that don't start at the beginning of each phrase.
Justin Peel
+2  A: 

You definetely need a Suffix Tree.. (wiki)

(i think this implementation could be ok: link)

EDIT:

I've read your comment, you shouldn't blindly check if the string is a substring somewhere in you phrase, you usually start with a word, not with a space. So maybe it's better to tokenize words inside your phrase?

Are you allowed to do it? Otherwise the best way is to build an automata for every phrase or using similar algorithms (for example the Karp-Rabin string search algorithm).

Jack
+3  A: 

Try simply looping over the entire list and calling contains() - doing that 2500 times is probably completely unnoticeable.

Michael Borgwardt
he puts "every single possible substring mapped to its possible matches" in the map
matt b
You're quite right, of course. No need for any heavy machinery at all. Thanks.
NellerLess
+3  A: 

If your list only contains 2500 elements, a simple loop and checking contains() on all of them should be fast enough.

If it grows bigger and/or is too slow, you can apply some easy optimizations:

  • Don't search immediately as the user types each character, but introduce some delay. So if he types "foobar" really fast, you only search for "foobar", not first "f" then "fo" then "foo",...
  • Reuse your previous results: if the user first types "foo" and then extends that to "foobar", don't search in the whole original list again, but search inside the results for "foo" (because everything that contains "foobar" must contain "foo").

In my experience, these these basic optimizations already get you quite far.

Now, if the list grows so big that even that is too slow, some "smarter" optimizations as proposed in other answers here (tries, suffix trees,...) would be needed.

Wouter Coekaerts
A: 

Wouter Coekaerts has a good approach, but I would go a bit further.

  1. Don't bring up anything when the textbox contains a single character. The results won't be useful. You may find that this is true for two characters as well.
  2. Precompute the results for two characters. When there are two characters bring up the precomputed list.
  3. When a third character is added do the 'contains' search on the list you have currently displayed (anything that doesn't contain c1c2 can't contain c1c2c3). By now the list should be small enough that 'contains' has perfectly adequate performance.
  4. Similarly for four characters etc.
  5. As said above, put in a little delay before starting the search. Or better still arrange for a search to be killed if another character is typed before it finishes.
DJClayworth