views:

311

answers:

4

Given a string s, what is the fastest method to generate a set of all its unique substrings?

Example: for str = "aba" we would get substrs={"a", "b", "ab", "ba", "aba"}.

The naive algorithm would be to traverse the entire string generating substrings in length 1..n in each iteration, yielding an O(n^2) upper bound.

Is a better bound possible?

(this is technically homework, so pointers-only are welcome as well)

+1  A: 

For big oh ... Best you could do would be O(n^2)

No need to reinvent the wheel, its not based on a strings, but on a sets, so you will have to take the concepts and apply them to your own situation.

Algorithms

Really Good White Paper from MS

In depth PowerPoint

Blog on string perms

Nix
Err, what? O(a^n)? What is a and where did you pull that result from?
IVlad
sorry that notation defines growth over time... i switched back
Nix
+2  A: 

There is no way to do this faster than O(n2) because there are a total of O(n2) substrings in a string, so if you have to generate them all, their number will be n(n + 1) / 2 in the worst case, hence the upper lower bound of O(n2) Ω(n2).

IVlad
We could hope for an output-sensitive algorithm that ran in O(# of unique strings)...
You mean a lower bound of Ω(n^2). :)
KennyTM
Suffix trees do it in O(n + L) time, where L is the number of unique substrings. For strings like 'aaaaa....aaaaa', L = O(n). So statement about Omega(n^2) is incorrect.
Moron
@Moron - I'm curious how suffix trees can solve this in O(n + L). Mind posting an algorithm?
IVlad
@IVlad: Just walk the whole suffix tree and print the paths/sub-paths as you go along. Wouldn't that be O(L)? Of course this assumes that printing a string is O(1) (for instance, we print only the begin and end indices). If we consider printing string x to take O(|x|) time, then yes, it is Omega(n^2). That you have considered that is not clear from your post, and I would guess your post actually implies an Omega(n^3) lower bound for that case.
Moron
@IVlad: Of course, I haven't tried proving the claim about suffix trees, but it seems right to me. Also we can consider print=generate in above comment, I guess.
Moron
+1  A: 

well, since there is potentially n*(n+1)/2 different substrings (+1 for the empty substring), I doubt you can be better than O(n*2) (worst case). the easiest thing is to generate them and use some nice O(1) lookup table (such as a hashmap) for excluding duplicates right when you find them.

greetz
back2dos

back2dos
It's `n(n+1)/2`. "abc" has 3*4/2 = 6 substrings ("a", "b", "c", "ab", "bc", "abc") not 3*2/2 = 3 substrings.
IVlad
Note that hash tables use hashcodes and equals which are O(n) for the length of the string.
fgb
oh, yes, sorry ... will fix that ... :)
back2dos
@fgb using a rolling hash for example you can get O(1) lookup.
IVlad
+3  A: 

As other posters have said, there are potentially O(n^2) substrings for a given string, so printing them out cannot be done faster than that. However there exists an efficient representation of the set that can be constructed in linear time: the suffix tree.

Rafał Dowgird
This is O(n+L) where L is number of unique substrings, I believe. So it is optimal.
Moron