views:

265

answers:

4

Given a bunch of strings I need to find those which match 3 kinds of patterns:

  • Prefix search - abc*
  • Glob-like pattern - abc:*:xyz
  • Suffix search - *xyz

where * is a wildcard (and can match any number of chars).

Now the straight-forward solution is just to scan every string and see if it matches the target pattern. But this is O(n). If I stored the strings in a balanced search tree, I can do the prefix queries in O(log n). If I created one more tree with all the strings essentially reversed, I can do the the suffix queries in O(log n). Is there a clever way to search for the "abc:*:xyz" patterns efficiently?

+2  A: 

Wouldn't an intersection of the results from the other two queries give you exactly that? And since each of the results is O(log N) and an intersect over that result set is O(N) in the result-set size, wouldn't the total also be an O(log N) over the original problem?

jerryjvl
Intersect being O(N) is based on the assumption that the two result sets you retrieve from the original trees are also sorted ;)
jerryjvl
This is still O(n) in the worst case.
laalto
No, that's still O(N) since the result set size is still bounded bei N.
balpha
It may be more correct to say that the combined operation is of the same order as the 'prefix' and 'suffix' operations are. I assumed from the statement of O(log N) in his question that the typical result-set only has on the order of log N elements in it, otherwise he cannot do the prefix or suffix operations in that amount of time either.
jerryjvl
A: 

If you take the possibility of storing the strings in a search tree into account, why not also store the properties "starts with abc" and "ends with xyz", using these properties as keys?

Edit: You also might want to leave Big-O-Notation behind and rather concentrate on the actual expected search duration in your particular use case. That's probably a more realistic approach; O(f(n)) style grades for your algorithm / implementation probably don't give you much usable information when it comes to your real-life search efficiency.

balpha
Yes you're right. I shouldn't have mentioned big-O. I was concerned with only the average-case efficiency not the worst-case.
Harish
A: 

If "abc" and "xyz" are fixed values, you can maintain three counters with your collection indicating the number of strings:

  • starting with "abc" but not ending with "xyz".
  • not starting with "abc" but ending with "xyz".
  • starting with "abc" and ending with "xyz".

That gives you an O(1) time complexity for searching at the cost of extra calculation when inserting into, or deleting from, the collection.

If the "abc" and "xyz" are arbitrary strings, it's O(n) for all operations, including the "abc..." one. You only have to consider what happens when your collections consists of items that all start with "abc" to see this. That's not bounded by O(logN) at all since you have to process all items in the tree (both branches of every non-leaf node).

I think your ideal solution is to maintain the two ordered trees, one for the normal strings and one for the reversed strings. But don't worry about trying to do an intersection between the two. All you need to do is minimize the search space as much as practicable.

  • To find "abc...", use the normal tree to find the strings starting with that value.
  • To find "...xyz", use the reverse tree to find the strings ending with the reverse of that that value (zyx...).
  • To find "abc...xyz", use the normal tree to find the strings starting with that value and then filter out those that don't end in "xyz".

That way you don't have to worry about intersecting values between the two trees and you still get a performance improvement over the simplistic linear search.

paxdiablo
"abc" and "xyz" are arbitrary of course. Ok you're right about it not being O(log N). I should have stayed away from mentioning anything in Big-O. But in most cases you can definitely prune away certain branches of the tree.
Harish
+1  A: 

Generate rotations of each word and put each rotation in a suffix tree with indication of "rotation index'.

For example, to put the string "hello", put

hello, 0
elloh, 1
llohe, 2
lohel, 3
ohell, 4

Also, you put "hero" as

hero, 0
eroh, 1
rohe, 2
oher, 3

Also, you put "ohe" (don't ask me what's ohe)

ohe, 0
heo, 1
eoh, 2

Then, if you need to search for pattern "he*o", you need to rotate it until you get a prefixed string: "ohe*"

In the suffix tree you find the candidates: (ohell, 4), (oher, 3), (ohe, 0). Then you restore their original versions (by un-rotating them) and pick the right ones - "hello" and "hero".

Igor Krivokon
Harish