views:

96

answers:

1

I have recently came up against this code lacking any comment. It finds minimal cyclic shift of word (this code specifically returns its index in string) and its called Duval algorithm. Only info I found describes algorithm in few words and has cleaner code. I would appreciate any help in understanding this algorithm. I have always found text algorithms pretty tricky and rather hard to understand.

int minLexCyc(const char *x) {
    int i = 0, j = 1, k = 1, p = 1, a, b, l = strlen(x);
    while(j+k <= (l<<1)) {
        if ((a=x[(i+k-1)%l])>(b=x[(j+k-1)%l])) {
            i=j++;
            k=p=1;
        } else if (a<b) {
            j+=k; 
            k=1; 
            p=j-i;
        } else if (a==b && k!=p) {
            k++;
        } else {
            j+=p; 
            k=1;
        }
    }
    return i;
}
+3  A: 

First, I believe that your code has a bug in it. The last line should be return p;. I beleve that i holds the index of the lexicographically smallest cyclic shift, and p holds the smallest shift that matches. I also think that your stopping condition is too weak, i.e. you are doing too much checking after you have found a match, but I am not sure exactly what it should be.

Note that i and j only advance and that i is always less than j. We are looking for a string that matches the string starting at i, and we are trying to match it with a string that starts at j. We do this by comparing the k'th character of each string while increasing k (as long as they match). Note that we only change i if we determine that the string starting at j is lexicographically less than the string starting at j, and then we set i to j and reset k and p to their initial values.

I do not have time for a detailed analysis, but it looks like

  1. i = the start of the lexicographic smallest cyclic shift
  2. j = the start of the cyclic shift we are matching against the shift starting at i
  3. k = the character in strings i and j currently under consideration (the strings match in positions 1 to k-1
  4. p = the cyclic shift under consideration (i believe p stands for prefix)

Edit Going further

this section of code

    if ((a=x[(i+k-1)%l])>(b=x[(j+k-1)%l])) {
        i=j++;
        k=p=1;

Moves the start of the comparison to a lexicographically earlier string when we find one and reinitializes everything else.

this section

   } else if (a<b) {
        j+=k; 
        k=1; 
        p=j-i;

is the tricky part. We have found a mismatch that is lexicographically later than our reference string, so we skip to the end of the text matched so far, and start matching from there. We also increase p (our stride). Why can we skip over all the starting points between j and j + k? This is because the string starting with i is the lexicographically smallest seen, and if the tail of the current j string is greater then the string at i then any suffix of the string at j will be greater than the string at i.

Finally

    } else if (a==b && k!=p) {
        k++;
    } else {
        j+=p; 
        k=1;

this just checks that the string of length p starting at i repeats.

*further edit We do this by incrementing k until k == p, checking that the k'th character of the string starting at i equals the k'th character of the string starting at j. Once k reaches p we start scanning again at the next supposed occurrence of the string.

Even further edit to attempt to answer jethro's questions.

First: the k != p in else if (a==b && k!=p) Here we have a match in that the k'th and all previous characters in the strings starting at i and j are equal. The variable p represents the length that we think that the repeating string is. When k != p, actually k < p, so we are ensuring that the p characters at the string beginning at i are the same as the p characters of the string beginning at j. When k == p (the final else) we should be at a point where the string starting at j + k looks the same as the string starting at j, so we increase j by p and set k back to 1 and go back to comparing the two strings.

Second: Yes, I believe you are correct, it should return i. I was misunderstanding the meaning of "Minimum Cyclic Shift"

deinst
+1: Seems helpful :-) Perhaps you should also mention the cases when k is > 1 (substrings at i and j match exactly at the first k positions I believe).
Moron
I figured that that was unneccessary, but hey, what the heck, I need to learn to be clearer.
deinst
@deinst. Thanks a lot for your time and explantation.Why do you think last statement should be return p? We looking for cyclic shift so IMHO i is correct. I still don't understand what we use p for ( for instance why we check in the last else if (k!=p)?
jethro
No, i is the start of the cycle, p is the stride. run the algorithm on 'bacdbacd'. IIRC at the end i will be 1 (pointing at the first a) and p will be 4 (the number of chars to rotate to get a match).
deinst
@Jethro I've added an update. Yes, I believe that you are correct and it should return i. I was misunderstanding the meaning of Minimum cyclic shift.
deinst