views:

420

answers:

4

I wonder if anyone knows the (optimal?) algorithm for longest recurring non-overlapping sub string.

For example, in the string

ABADZEDGBADEZ

the longest recurring would be "BAD". Incidentally if there is no such result, the algorithm should alert that such a thing has occurred. My guess is that this involves suffix trees.

I'm sure this must exist somewhere already. Thanks for the help!

+3  A: 

Since you're applying this to music, you're probably not looking for 100% matches; there will be expected discrepencies from one instance of the theme to another. You might try looking up gene sequence analysis algorithms - there's lots of this variety of stuff going on there. Try BLAST or other alignment algorithms.

You could also try the WinEPI family of algorithms, as well as various prefix tree implementations of this specific result set (these results allow gaps in the substring; for instance, ABCID and AFBCD both contain ABCD). I actually have a paper on an algorithm that might be worth looking at if you would be interested; I would need to attain distribution authorization, so let me know.

Note that this is actually a VERY complicated subject (to do efficiently) for large datasets.

Source: Research for a year or two on a comparable (theme-detection) topic.

Walt W
If you could grant me access, it would be appreciated!
Brandon Pelfrey
I think he's applying this to the lyrics, so I think he is looking for 100% matches.
las3rjock
@Brandon - I've requested permission, I'll see what I can do.@las3rjock - Not really. For instance:I was a silly manI was silly, manExample theme: Past-tense silliness. Strings are not an exact match. Plus, a lot of lyrics have different punctuation and whatnot. So I'm not sure he is looking for exact match.
Walt W
Bah formatting. Example was "I was a silly man" versus "I was silly, man"
Walt W
@Brandon - Turns out there are no distribution clauses, so I'll post a link tonight :)
Walt W
Yeah, about the lyrics thing. I'm talking about literally classifying music by notes. This has nothing to do with lyrics. You can transcode musical notes into a string of symbols. Thanks for checking out the distribution rights! =]
Brandon Pelfrey
Here you go - http://lamegame.no-ip.org/public/PrefixTrees_WWWKWDistributable.pdf Note that the algorithms section is section 4, and the pseudo code there's pretty explicit. Feel absolutely free to ask additional questions about the content or how you might tweak it to work best. Just as a heads up, you'd be operating with an occurrence of a word as a "feature" probably, and for your "time" component just use integers that denote the position in the sentence. Hopefully that makes sense after you look at the paper.
Walt W
+3  A: 

Suffix array is a good data structure for solving that problem.

There is a solution to this problem in Programming Pearls by Jon Bentley.

Nick D
Where in programming pearls?
Emil H
@Emil H, **15.2 Phrases**
Nick D
+1  A: 

A simple algorithm is to do this:

  • Create a data structure representing slices of a string; implement comparison / sorting as appropriate for the language
  • Create a list of every slice of the string starting with [first-character, last-character], [second-character, last-character], up to [last-character, last-character]
  • Sort this list - O(n log n)
  • This brings all string slices with common prefixes together. You can then iterate through the list, comparing each pair to see how many characters they share at the start, and if it's larger than your maximum then take a note of it and set it as the new maximum

(As the other reply just posted indicates, I'm describing a suffix array here.)

Barry Kelly
This is still rather brute force. I'm wondering if there's a slightly more elegant approach? I believe a tree-based approach would preserve the structural information as well as provide some sort of depth information that can be quickly traversed to provide longest length information.
Brandon Pelfrey
With an appropriate sort - see the suffix array wikipedia article - the running time is O(n log n) in the worst case and usually better. The iteration is O(n), so is dominated by sorting cost. The obvious brute force is going to be O(n^2) at least (comparing every possible pair). Building trees is likely going to cost a lot more memory, which will have adverse real-world effects on performance (consider cache etc.).
Barry Kelly
A: 

Okay, here's one crazy idea.

Create a function that determines if there is a recurring substring of length k in O(n) (where n is the length of the text). This can be done by using a rolling hash (see wikipedia for Rabin-Karp String Matching Algorithm) to generate all n hashes in linear time and using a hashtable/BST (or a map or dict or whatever your language calls it) to store the corresponding substring positions. Before inserting the current hash to the data structure, we check if we have seen it before. If it has been seen before, we simply look up the indices where this hash was generated and see if the corresponding substring is a non-overlapping match.

Now that we can find a k length substring in O(n) time, we simply run a binary search to find the largest k for which we can find a non-overlapping substring match using our function.

Code in Python follows

A=23
MOD=10007 # use a random value in the real world

def hsh(text):
    return sum([ord(c)*A**(len(text)-i-1) for i,c in enumerate(text)])%MOD

def k_substring(text, k):
    substrings={}
    cur=hsh(text[:k])
    pwr=(A**(k-1))%MOD
    substrings[cur]=[0]
    for i in xrange(k,len(text)):
     to_remove=(ord(text[i-k])*pwr)%MOD
     to_add=ord(text[i])
     cur-=to_remove
     if cur<0:
      cur+=MOD
     cur=(cur*A)%MOD
     cur=(cur+to_add)%MOD
     if cur in substrings:
      lst=substrings[cur]
      for index in lst:
       if index+k<=i-k+1 and text[index:index+k]==text[i-k+1:i+1]:
        return index
      lst.append(i-k+1)
     else:
      substrings[cur]=[i-k+1]
    return -1

def longest_substring(text):
    hi,lo=len(text),0
    while hi>lo:
     mid=(hi+lo+1)>>1
     if k_substring(text,mid)==-1:
      hi=mid-1
     else:
      lo=mid
    index=k_substring(text,lo)
    return text[index:index+lo]

(Sorry if this is unclear. It is 6:30am here and I really need to go back to bed :) )

MAK