views:

91

answers:

2

I need to find out how much percentage or chars does one string contains into another string. I've tried Levenshtein Distance but that algorithm returns how much char's are needed to be change for the strings to be equal. Can some one help? I need it in c# but that's not so important.

The answer code: public double LongestCommonSubsequence(string s1, string s2) { //if either string is empty, the length must be 0 if (String.IsNullOrEmpty(s1) || String.IsNullOrEmpty(s2)) return 0;

    int[,] num = new int[s1.Length, s2.Length];  //2D array
    char letter1;
    char letter2;

    //Actual algorithm
    for (int i = 0; i < s1.Length; i++)
    {
        letter1 = s1[i];
        for (int j = 0; j < s2.Length; j++)
        {
            letter2 = s2[j];

            if (letter1 == letter2)
            {
                if ((i == 0) || (j == 0))
                    num[i, j] = 1;
                else
                    num[i, j] = 1 + num[i - 1, j - 1];
            }
            else
            {
                if ((i == 0) && (j == 0))
                    num[i, j] = 0;
                else if ((i == 0) && !(j == 0))   //First ith element
                    num[i, j] = Math.Max(0, num[i, j - 1]);
                else if (!(i == 0) && (j == 0))   //First jth element
                    num[i, j] = Math.Max(num[i - 1, j], 0);
                else // if (!(i == 0) && !(j == 0))
                    num[i, j] = Math.Max(num[i - 1, j], num[i, j - 1]);
            }
        }//end j
    }//end i
    return (s2.Length - (double)num[s1.Length - 1, s2.Length - 1]) / s1.Length * 100; 
} //end LongestCommonSubsequence
A: 

Uhh... can't you just use the number of characters that need to change then?

(length(destination)-changed_character_count)/ length(source)

EDIT: based on the revised question, treat both strings as sets, compute the set intersection, and base the percentage off the size of that set and the source string as a set.

MSN
I need how much one string contains into another, for example "Ivan" in "This is Ivan Jovanov" is contained 100%.
Pece
@Pece: the Levenshtein distance would tell you that. That's why you compare the length of the destination string minus the size of the edits to the length of the source string. In your test case, it should end up being 100% because you don't actually delete any characters from the source string.
MSN
The problem in this is if I compare "Ivan" with "Ivaxxxn" is that if I use: "(length(destination)-changed_character_count)/ length(source)" it will return 100%
Pece
That's an additional constraint you should probably specify.
MSN
+2  A: 

It sounds like you might want the longest common subsequence which is the basis for diff algorithms. Unfortunately this problem is NP-hard which means there is no efficient (polynomial time) solution. The Wikipedia page has some suggestions.

Mark Byers
Here the problem only consider 2 strings, therefore it can be done in quadratic time.
Mgccl
Write now I'm testing this, so I'll write the results in a few minutes.
Pece
Yep the tests went well, thanks.I'll edit the Question with the c# algorithm.
Pece