views:

978

answers:

5

I have recently come across an interesting question on strings. Suppose you are given following:

Input string1: "this is a test string"
Input string2: "tist"
Output string: "t stri"

So, given above, how can I approach towards finding smallest substring of string1 that contains all the characters from string 2?

A: 

Always going for the dumbest solution, there are n^2 substrings, try each and find the shortest. Runs in O(n^3), but I believe this can be done in O(n).

min-so-far = INFINITY
min-start = null
min-end = null

i := 1 to string1.length - 1
    j := i+1 to string1.length
        // check if string1(i, j) contains all characters in string2
        // and update min-so-far if it's lower than current best
        // update min-start to i, and min-end to j
Anurag
@Anurag, I see space complexity is O(1) and time complexity is O(n^3) if I am not wrong. For every substring of string1, we are going to determine if say m characters of string2 are in that substring or not. am I correct?
Rajendra Kumar Uppal
that's a good optimization Carl.. i just wanted to keep it as simple as possible, so used comments for the non-relevant code.
Anurag
@Rajendra, yepp you're right my oversight :) it's n^3
Anurag
+2  A: 

Here's an O(n) solution. The basic idea is simple: for each starting index, find the least ending index such that the substring contains all of the necessary letters. The trick is that the least ending index increases over the course of the function, so with a little data structure support, we consider each character at most twice.

In Python:

from collections import defaultdict

def smallest(s1, s2):
    assert s2 != ''
    d = defaultdict(int)
    nneg = [0]  # number of negative entries in d
    def incr(c):
        d[c] += 1
        if d[c] == 0:
            nneg[0] -= 1
    def decr(c):
        if d[c] == 0:
            nneg[0] += 1
        d[c] -= 1
    for c in s2:
        decr(c)
    minlen = len(s1) + 1
    j = 0
    for i in xrange(len(s1)):
        while nneg[0] > 0:
            if j >= len(s1):
                return minlen
            incr(s1[j])
            j += 1
        minlen = min(minlen, j - i)
        decr(s1[i])
    return minlen
@algorithmist, I have not worked in Python, but I can get that for...while loop doesn;t seem to O(n). Can you please tell your approach taking example given in the question, would appreciate that.
Rajendra Kumar Uppal
j can only increase len(s1) times, so the while loop does O(n) work in total.
@Rajendra: This algorithm does exactly what I described in my post, if that helps--`i` marks the tail of the substring and `j` marks the head. @algorithmist: nice work, coming up with code ever-so-slightly-faster than I came up with a description!
Rex Kerr
+5  A: 

You can do a histogram sweep in O(N+M) time and O(1) space where N is the number of characters in the first string and M is the number of characters in the second.

It works like this:

  • Make a histogram of the second string's characters (key operation is hist2[ s2[i] ]++).
  • Make a cumulative histogram of the first string's characters until that histogram contains every character that the second string's histogram contains (which I will call "the histogram condition").
  • Then move forwards on the first string, subtracting from the histogram, until it fails to meet the histogram condition. Mark that bit of the first string (before the final move) as your tentative substring.
  • Move the front of the substring forwards again until you meet the histogram condition again. Move the end forwards until it fails again. If this is a shorter substring than the first, mark that as your tentative substring.
  • Repeat until you've passed through the entire first string.
  • The marked substring is your answer.

Note that by varying the check you use on the histogram condition, you can choose either to have the same set of characters as the second string, or at least as many characters of each type. (Its just the difference between a[i]>0 && b[i]>0 and a[i]>=b[i].)

You can speed up the histogram checks if you keep a track of which condition is not satisfied when you're trying to satisfy it, and checking only the thing that you decrement when you're trying to break it. (On the initial buildup, you count how many items you've satisfied, and increment that count every time you add a new character that takes the condition from false to true.)

Rex Kerr
+1: This is much more readable than python. Would be nice if you included a proof/explanation of why it works too.
Moron
@Rex Kerr: I fail to see how that is O(1) space. Will your histograms not take up O(N+M) space if all character are unique (worst case)?
Mads Ravn
O(M) space can be done (rather than O(N+M)), as you don't really need to concern yourself with characters not present is s2. I agree though, that the space usage is O(1) seems incorrect and does not seem to match the description.
Moron
You can always bound the space required by the size of the character set. Since the character set is a constant size, it's called `O(1)`. The constant may, however, be large (larger than `M`). Big-O notation is useful for knowing what will happen when your strings get really large, not really small. For bytes, "O(1)" is a bit over 256. (P.S.--I'm afraid I don't have time to write a sufficiently detailed proof of time taken; just note that you have one pass through `M` to build the first histogram and then the head and tail of the substring pass at most once through `N`.)
Rex Kerr
@Rex: I think we all know what O(1) means. You are missing the point, and the proof request was not about why it is O(N), it was about why it is _correct_. If you like, I can add that to your post.
Moron
@Moron: You are completely right. I was just following the steps of the algorithm and did not take this (simple) optimization into account. @Rex Kerr: From a theoretical standpoint I believe you are wrong. If you allocate a constant amount of memory I can chose M large enough that your counters overflow, so we would need at least O(log_2(M)) space. In a more pragmatic use of the notation I would also consider O(1) to be a bit misleading as one usually associates this with a small fixed amount of memory. Can we settle for O(min(charset size, M)) :)
Mads Ravn
@Moron: Why not add your own answer about why it is/isn't correct, since both algorithmist and I have the same answer (as does the solution at the end of nvl's link)? @Mads: Good point--let's say `O(|set(M)|)` perhaps, where `|set(M)|` is the number of unique characters in `M`.
Rex Kerr
@Rex. I am not claiming it is incorrect. I gave it a +1 already (i.e. I think it is correct). I really don't see the point of having multiple answers saying the same thing in different ways or one answer supplementing the other. This site is meant to answer questions, not to drown the questioner with a lot of varied answers saying similar things. I suggested you add a proof to make your answer more complete (and better), not to question the correctness. You don't seem to get the point of this site... Anyway, I am done with this conversation.
Moron
A: 

Edit: apparently there's an O(n) algorithm (cf. algorithmist's answer). Obviously this have this will beat the [naive] baseline described below!

Too bad I gotta go... I'm a bit suspicious that we can get O(n). I'll check in tomorrow to see the winner ;-) Have fun!

Tentative algorithm:
The general idea is to sequentially try and use a character from str2 found in str1 as the start of a search (in either/both directions) of all the other letters of str2. By keeping a "length of best match so far" value, we can abort searches when they exceed this. Other heuristics can probably be used to further abort suboptimal (so far) solutions. The choice of the order of the starting letters in str1 matters much; it is suggested to start with the letter(s) of str1 which have the lowest count and to try with the other letters, of an increasing count, in subsequent attempts.

  [loose pseudo-code]
  - get count for each letter/character in str1  (number of As, Bs etc.)
  - get count for each letter in str2
  - minLen = length(str1) + 1  (the +1 indicates you're not sure all chars of 
                                str2 are in str1)
  - Starting with the letter from string2 which is found the least in string1,
    look for other letters of Str2, in either direction of str1, until you've 
    found them all (or not, at which case response = impossible => done!). 
    set x = length(corresponding substring of str1).
 - if (x < minLen), 
         set minlen = x, 
         also memorize the start/len of the str1 substring.
 - continue trying with other letters of str1 (going the up the frequency
   list in str1), but abort search as soon as length(substring of strl) 
   reaches or exceed minLen.  
   We can find a few other heuristics that would allow aborting a 
   particular search, based on [pre-calculated ?] distance between a given
   letter in str1 and some (all?) of the letters in str2.
 - the overall search terminates when minLen = length(str2) or when 
   we've used all letters of str1 (which match one letter of str2)
   as a starting point for the search
mjv
+1  A: 

I was just going through this discussion at joelonsoftware.com . Nice approach for finding required smallest substring.

N 1.1