views:

820

answers:

6

Input:

1) A huge sorted array of string SA;

2) A prefix string P;

Output:

The index of the first string matching the input prefix if any. If there is no such match, then output will be -1.

Example: SA = {"ab", "abd", "abdf", "abz"} P = "abd" The output should be 1 (index starting from 0).

What's the most algorithm way to do this kind of job?

A: 

The FreeBSD kernel use a Radix tree for its routing table, you should check that.

Keltia
+2  A: 

This is just a modified bisection search:

  • Only check as many characters in each element as are in the search string; and
  • If you find a match, keep searching backwards (either linearly or by further bisection searches) until you find a non-matching result and then return the index of the last matching result.
cletus
+1  A: 

It can be done in linear time using a Suffix Tree. Building the suffix tree takes linear time.

Brian
He could compare the prefix to each element in linear time too.
Ryan Fox
True, but if he's going check more than one prefix, that's not a good idea.
Brian
A: 

My current solution in mind is, instead of to find the "prefix", try to find a "virtual prefix".

For example, prefix is “abd", try to find a virtual-prefix “abc(255)". (255) just represents the max char number. After locating the "abc(255)". The next word should be the first word matching "abd" if any.

Morgan Cheng
+9  A: 

If you only want to do this once, use binary search, if on the other hand you need to do it for many different prefixes but on the same string array, building a radix tree can be a good idea, after you've built the tree each look up will be very fast.

Best case lookup time for a radix tree is O(n) where n is the number of characters, while binary search takes O(log m) where m is the size of the list. It's not necessarially true that radix search will be faster.
Nick Johnson
Arachnid: This is not true. Even in the best case (where there is a match), binary search is O(n + log m), since it must read the entire prefix to check if it matches. In the worst case it is O(nlogm).
Brian
@Arachnid: +1 for Brian's comment, plus O(n) is WORST case lookup for radix tree, where there is a match and you need to read the whole prefix.
Alabaster Codify
Well, a binary search is almost trivial to implement, and takes no additional memory. Radix trees are complicated to implement, take extra memory, and the benefit isn't huge
FryGuy
A: 

Are you in the position to precalculate all possible prefixes?

If so, you can do that, then use a binary search to find the prefix in the precalculated table. Store the subscript to the desired value with the prefix.

EvilTeach