views:

647

answers:

5

Hi,

I was going through some Amazon interview questions on CareerCup.com, and I came across this interesting question which I haven't been able to figure out how to do. I have been thinking on this since 2 days. Either I am taking a way off approach, or its a genuinely hard function to write.

Question is as follows:

Write a function in C that can find if a string is a sub-string of another. Note that a mismatch of one character should be ignored.

A mismatch can be an extra character: ’dog’ matches ‘xxxdoogyyyy’  
A mismatch can be a missing character: ’dog’ matches ‘xxxdgyyyy’ 
A mismatch can be a different character: ’dog’ matches ‘xxxdigyyyy’

The return value wasn't mentioned in the question, so I assume the signature of the function can be something like this:

char * MatchWithTolerance(const char * str, const char * substr);

If there is a match with the given rules, return the pointer to the beginning of matched substring within the string. Else return null.

Bonus

If someone can also figure out a generic way of making the tolerance to n instead of 1, then that would be just brilliant. In that case the signature would be:

char * MatchWithTolerance(const char * str, const char * substr, unsigned int tolerance = 1);

Thanks to all those who would attempt this and share their successful solution.

A: 

Is the problem to do this efficiently?

The naive solution is to loop over every substring of size substr in str, from left to right, and return true if the current substring if only one of the characters is different in a comparison.

Let n = size of str Let m = size of substr

There are O(n) substrings in str, and the matching step takes time O(m). Ergo, the naive solution runs in time O(n*m)

Stefan Kendall
I guess, efficiency wouldn't be the primary thing they are testing in the question. Still I would say, lets try to do it in O(n*m).
bits
That solution IS O(n*m)
Stefan Kendall
-1 This approach is not even close to the actual solution posted by IVlad.
bits
+3  A: 

This seems to work, let me know if you find any errors and I'll try to fix them:

int findHelper(const char *str, const char *substr, int mustMatch = 0)
{
    if ( *substr == '\0' )
        return 1;

    if ( *str == '\0' )
        return 0;

    if ( *str == *substr )
        return findHelper(str + 1, substr + 1, mustMatch);
    else
    {
        if ( mustMatch )
            return 0;

        if ( *(str + 1) == *substr )
            return findHelper(str + 1, substr, 1);
        else if ( *str == *(substr + 1) )
            return findHelper(str, substr + 1, 1);
        else if ( *(str + 1) == *(substr + 1) )
            return findHelper(str + 1, substr + 1, 1);
        else if ( *(substr + 1) == '\0' )
            return 1;
        else
            return 0;
    }
}

int find(const char *str, const char *substr)
{
    int ok = 0;
    while ( *str != '\0' )
        ok |= findHelper(str++, substr, 0);

    return ok;
}


int main()
{
    printf("%d\n", find("xxxdoogyyyy", "dog"));
    printf("%d\n", find("xxxdgyyyy", "dog"));
    printf("%d\n", find("xxxdigyyyy", "dog"));
}

Basically, I make sure only one character can differ, and run the function that does this for every suffix of the haystack.

IVlad
+1!! That works brilliantly. I am sure we can make this return pointer to the beginning of the matched string. And This can be made generic too. Thanks a lot! If I come across any unmatching scenarios, I will reply back.
bits
+3  A: 

This is slightly different than the earlier solution, but I was intrigued by the problem and wanted to give it a shot. Obviously optimize if desired, I just wanted a solution.

char *match(char *str, char *substr, int tolerance)
{
  if (! *substr) return str;
  if (! *str) return NULL;

  while (*str)
  {
    char *str_p;
    char *substr_p;
    char *matches_missing;
    char *matches_mismatched;

    str_p = str;
    substr_p = substr;

    while (*str_p && *substr_p && *str_p == *substr_p)
    {
      str_p++;
      substr_p++;
    }
    if (! *substr_p) return str;
    if (! tolerance)
    {
      str++;
      continue;
    }

    if (strlen(substr_p) <= tolerance) return str;

    /* missed due to a missing letter */
    matches_missing = match(str_p, substr_p + 1, tolerance - 1);
    if (matches_missing == str_p) return str;

    /* missed due to a mismatch of letters */
    matches_mismatched = match(str_p + 1, substr_p + 1, tolerance - 1);
    if (matches_mismatched == str_p + 1) return str;

    str++;
  }
  return NULL;
}
Brandon Horsley
+1 for actually attempting the problem.
IVlad
I haven't tried out your solution yet, but Non-recursive approach is supposedly very difficult. +1 for Attempting.
bits
+2  A: 

This is related to a classical problem of IT, referred to as Levenshtein distance. See Wikibooks for a bunch of implementations in different languages.

Pumbaa80
+1 for not writing code that's clearly going to be plagiarized and submit as his own.
Stefan Kendall
No, this has nothing to do with the levenshtein distance as far as I can see. This asks for substring matching where 1 character is allowed not to match. How do you propose we use the levenshtein distance for this?
IVlad
I should have been a little clearer. The Levenshtein implementations do not solve the original problem. But, looking at the algorithm you'll realize that only a slight change is needed to achieve fuzzy substring matching. Again, the internets has an implementation ready: http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/
Pumbaa80
@Stefan Why are you picking on me in every damn comment? I am not submitting this code anywhere! You didn't even read my question properly. I wrote, that I was going through interview questions and found this interesting. And I tried to solve it, but failed somehow. Why do you think that I am planning to plagiarize this? This means you don't even understand how interviews in real worls take place. No company would ask you to write that big a function on phone. They will only ask you to write such a code on face-2-face interview, where you clearly cannot copy-paste from Stack Overflow.
bits
@Pumbaa Thanks for your inputs. I appreciate it. +1
bits
@bits: There are such thing as automated source code submissions, or email-based code submissions.
Stefan Kendall
A: 

With arbitary no. of tolerance levels.

Worked for all the test cases I could think of. Loosely based on |\/|ad's solution.

#include<stdio.h>
#include<string.h>

report (int x) {
    if ( x )
        printf( "%s is a substring of %s for a tolerance[%d]\n",sstr,str[i],t[i] );
    else
        printf ( "%s is NOT a substring of %s for a tolerance[%d]\n",sstr,str[i],t[i] );
}

int find_with_tolerance (char *str, char *sstr, int tol) {

    if ( (*sstr) == '\0' ) //end of substring, and match
        return 1;

    if ( (*str) == '\0' ) //end of string
        if ( tol >= strlen(sstr) ) //but tol saves the day
            return 1;
        else    //there's nothing even the poor tol can do
            return 0;

    if ( *sstr == *str ) { //current char match, smooth
        return find_with_tolerance ( str+1, sstr+1, tol );
    } else {
        if ( tol <= 0 ) //that's it. no more patience
            return 0;
        for(int i=1; i<=tol; i++) {
            if ( *(str+i) == *sstr ) //insertioan of a foreign character
                return find_with_tolerance ( str+i+1, sstr+1, tol-i );
            if ( *str == *(sstr+i) ) //deal with dletion
                return find_with_tolerance ( str+1, sstr+i+1, tol-i );
            if ( *(str+i) == *(sstr+i)  ) //deal with riplacement
                return find_with_tolerance ( str+i+1, sstr+i+1, tol-i );
            if ( *(sstr+i) == '\0' ) //substr ends, thanks to tol & this loop
                return 1;
        }
        return 0; //when all fails
    }
}

int find (char *str, char *sstr, int tol ) {
    int w = 0;
    while (*str!='\0')
        w |= find_with_tolerance ( str++, sstr, tol );
    return (w) ? 1 : 0;
}

int main() {
    const int n=3; //no of test cases
    char *sstr = "dog"; //the substr
    char *str[n] = { "doox", //those cases
                    "xxxxxd",
                    "xxdogxx" };
    int t[] = {1,1,0}; //tolerance levels for those cases
    for(int i = 0; i < n; i++) {
        report( find ( *(str+i), sstr, t[i] ) );
    }
    return 0;
}
ignoramous