ansaurus

Question

Given two strings, find the longest common bag of chars

Answer 1

+4 A:

Create a set of the characters present in a, and another of the characters present in b. Walk through each string and strike (e.g., overwrite with some otherwise impossible value) all the characters not in the set from the other string. Find the longest string remaining in each (i.e., longest string of only "unstruck" characters).

Edit: Here's a solution that works roughly as noted above, but in a rather language-specific fashion (using C++ locales/facets):

#include <string>
#include <vector>
#include <iostream>
#include <locale>
#include <sstream>
#include <memory>

struct filter : std::ctype<char> {
    filter(std::string const &a) : std::ctype<char>(table, false) {
        std::fill_n(table, std::ctype<char>::table_size, std::ctype_base::space);

        for (size_t i=0; i<a.size(); i++) 
            table[(unsigned char)a[i]] = std::ctype_base::upper;
    }
private:
    std::ctype_base::mask table[std::ctype<char>::table_size];
};

std::string get_longest(std::string const &input, std::string const &f) { 
    std::istringstream in(input);
    filter *filt = new filter(f);

    in.imbue(std::locale(std::locale(), filt));

    std::string temp, longest;

    while (in >> temp)
        if (temp.size() > longest.size())
            longest = temp;
    delete filt;
    return longest;
}

int main() { 
    std::string a = "ABCDDEGF",  b = "FPCDBDAX";
    std::cout << "A longest: " << get_longest(a, b) << "\n";
    std::cout << "B longest: " << get_longest(b, a) << "\n";
    return 0;
}

Edit2: I believe this implementation is O(N) in all cases (one traversal of each string). That's based on std::ctype<char> using a table for lookups, which is O(1). With a hash table, lookups would also have O(1) expected complexity, but O(N) worst case, so overall complexity would be O(N) expected, but O(N²) worst case. With a set based on a balanced tree, you'd get O(N lg N) overall.

Jerry Coffin 2010-08-21 05:14:47

What if there are very few (or no) impossible characters? The problem you have "left" is just as bad as the original, and you have not stated a solution for that part.

Ether 2010-08-21 05:17:09

@Ether: This can be dealt with about a dozen different ways. One is to convert the characters to some other type with a greater range, so you have "spare" values to use. Another is to use a slightly different algorithm, such as separating the string into sub-strings instead of just "striking out" the values you don't care about.

Jerry Coffin 2010-08-21 05:21:40

I would just sort and walk, advancing the second index on a non-match and storing and advancing both indexes on a match. Especially since both arrays are the same lenght

Jason Coco 2010-08-21 05:21:59

@Jason: at least if I understand what you're getting at, I don't think that works. For example, both example strings contain "F", but it's not part of the longest substring (i.e., in both cases it's separate from any common character).

Jerry Coffin 2010-08-21 05:35:04

@Jerry Coffin: True. I had overlooked that F bit. Also, didn't see Jason Coco's comment.

dirkgently 2010-08-21 05:58:24

Can we do something like creating ascii char values array of two strings and than match two arrays for common ascaii values.

saurabh 2010-08-21 06:19:29

Say you have, after striking the chars (impossible being x)X = xxBABAxxY = xAxBBxAxThe longest bag is 1, 'A' or 'B'.It seems your algorithm would return 'BB'.

ring0 2010-08-21 07:22:18

@ring0: yes, my algorithm with return "BABA" and "BB". At least as I read the question, that's what he wants (e.g., his example of "ABCDD").

Jerry Coffin 2010-08-21 07:33:20

@Jerry: yes, but ABCDD letters and CDBDA letters are, both, in the same "bag", as consecutive letters. While BB on one side is not found in any of the BABA "bags". In other terms, taking one bag of consecutive letters from string X and one bag from string Y, they "match" if at least one permutation of X is Y.This is how I understand the problem.

ring0 2010-08-21 07:40:34

@ring0: you could easily be right, and if so it would certainly require a different algorithm than I've used.

Jerry Coffin 2010-08-21 07:49:45

Have to -1 as you're neglecting the "frequency" components of the elements in the bag. A "bag" is a proper math term, meaning the same as "multiset" -- a set where each element carries a count/multiplicity/frequency, and according to my interpretation of the problem, these counts are required to match. I believe you're answering a slightly different question, where "bag" has been replaced with "set", which makes the problem much easier.

j_random_hacker 2010-08-21 16:06:23

@Jerry Coffin: You're right, I misread the substring part.

Jason Coco 2010-08-21 16:08:44

@j_random_hacker: that is, of course, your privilege. Given how many people have understood the question in different ways, I think if a downvote is merited, it should go to the original question. Even after four or five revisions, the goal does not seem entirely clear.

Jerry Coffin 2010-08-21 16:19:12

@Jerry: The original question could have been phrased more clearly, though I think the latest update clarifies things. I realise you answered before that, and it's annoying when an OP edits the question afterwards. If you answer the question as currently posed, I'll of course +1.

j_random_hacker 2010-08-21 16:28:12

Answer 2

A:

Here's my rather anti-pythonic implementation that nevertheless leverages python's wonderful built in sets and strings.

a = 'ABCDDEGF'
b = 'FPCDBDAX'

best_solution = None
best_solution_total_length = 0

def try_expand(a, b, a_loc, b_loc):
    # out of range checks
    if a_loc[0] < 0 or b_loc[0] < 0:
        return
    if a_loc[1] == len(a) or b_loc[1] == len(b):
        return


    if set(a[a_loc[0] : a_loc[1]]) == set(b[b_loc[0] : b_loc[1]]):
        global best_solution_total_length, best_solution
        #is this solution better than anything before it?
        if (len(a[a_loc[0] : a_loc[1]]) + len(b[b_loc[0] : b_loc[1]])) > best_solution_total_length:
            best_solution = (a_loc, b_loc)
            best_solution_total_length = len(a[a_loc[0] : a_loc[1]]) + len(b[b_loc[0] : b_loc[1]])


    try_expand(a, b, (a_loc[0]-1, a_loc[1]), (b_loc[0], b_loc[1]))
    try_expand(a, b, (a_loc[0], a_loc[1]+1), (b_loc[0], b_loc[1]))
    try_expand(a, b, (a_loc[0], a_loc[1]), (b_loc[0]-1, b_loc[1]))
    try_expand(a, b, (a_loc[0], a_loc[1]), (b_loc[0], b_loc[1]+1))


for a_i in range(len(a)):
    for b_i in range(len(b)):
        # starts of the recursive expansion from identical letters in two substrings
        if a[a_i] == b[b_i]:
            # if substrings were expanded from this range before then there won't be an answer there
            if best_solution == None or best_solution[0][0] > a_i or best_solution[0][1] <= a_i or best_solution[1][0] > b_i or best_solution[1][1] <= b_i:
                    try_expand(a, b, (a_i, a_i), (b_i, b_i))


print a[best_solution[0][0] : best_solution[0][1]], b[best_solution[1][0] : best_solution[1][1]]

Forgot to mention that this is obviously a fairly bruteforce approach and I'm sure there's an algorithm that runs much, much faster.

Novikov 2010-08-21 06:38:49

Answer 3

+3 A:

Just a note to say that this problem will not admit a "greedy" solution in which successively larger bags are constructed by extending existing feasible bags one element at a time. The reason is that even if a length-k feasible bag exists, there need not be any feasible bag of length (k-1), as the following counterexample shows:

ABCD
CDAB

Clearly there is a length-4 bag (A:1, B:1, C:1, D:1) shared by the two strings, but there is no shared length-3 bag. This suggests to me that the problem may be quite hard.

j_random_hacker 2010-08-21 16:17:21

Mega-ouch! I was looking at a greedy approach but you showed it doesn't work. Lacking a greedy approach it looks like there can't be an answer under O(n!) runtime.

Loren Pechtel 2010-08-21 16:33:32

Well, there's always the O(n^4) brute-force approach of comparing every substring of A with every substring of B. And there could be a divide-and-conquer or dynamic programming approach I'm not seeing. Also I'm pretty sure a faster solution should be possible for small alphabets (e.g. binary). Would be nice to think about this more but I gotta do some real work now! :)

j_random_hacker 2010-08-21 19:03:29

Answer 4

+1 A:

lets look at this problem like this.. this solution is going to more optimized and will be very easy to code but read through the def and you MUST read the code to get the idea... else it will just sound crazy and complex

THINK ABOUT THIS

in your questions the 2 example strings you gave lets take them as two set, i.e {x,y,z}, of characters...

AND.. AND... your resulting substring(set) will be one with characters common in both strings(sets) and will be continuous entries and the qualifying substring(ser) will be one with highest number of entries

above are a few properties of the result but will only work if used via the following algorithm\methodolgy

we have two sets

a = { BAHYJIKLO }

b = { YTSHYJLOP }

Take

a U b = { - , - , H , Y , J , - , - , L , O }

b U a = {Y , - , - , H , Y , J , L , O , -}

its just that i have replaced the characters who didn't qualify for union set with a "-" or any special\ignored character

doing so we have two strings from which we can easily extract HYJ,LO,Y,HYJLO

now string\substrings comparisons and different processing takes time so what i do is i write these strings\substrings to a text file with separated by space or different lines.. so that when i read a file i get the whole string instead of having a nested loop to locate a substring or manage temporary variables....

after you have HYJ,LO,Y,HYJLO i don't think its a problem to find your desired result....

NOTE: if you start processing the strings and sub strings in this with temporary variables and nested loops for first make a sub string then search for it... then its going to be very costly solution... you have to use filing like this...

char a[20], b[20]; //a[20] & b[30] are two strings
cin>>a; cin>>b;
int t=0;

open a temporary text file "file1" to write '(built-in-function works here)'
//a U b
for(int x=0; x<length(a); x++)
{
    t=0;

    for(int y=0; y<length(b); x++)
       { if( a[x] == b[y]) t=1; }

    if(t == 1)
       { 
          write 'a[x]' to the file1 '(built-in-function works here)'
          t=0;
       }
    else
       write a 'space' to the file1 '(built-in-function works here)'
}

//b U a
for(int x=0; x<length(a); x++)
{
    t=0;

    for(int y=0; y<length(b); x++)
       { if( b[x] == a[y]) t=1; }

    if(t == 1)
       {
         write 'a[x]' to the file1 '(built-in-function works here)'
         t=0;
       }
    else
       write a 'space' to the file1 '(built-in-function works here)'
}
/*output in the file wil be like this
_____FILE1.txt_____
  HYJ  LO Y HYJLO        
*/
//load all words in an array of stings from file '(built-in-function works here)'

char *words[]={"HYJ","LO","Y","HYJLO"};
int size=0,index=0;

for( int x=0; x<length(words); x++)
    for( int y=0; x<length(words); y++)
    {
       if( x!=y && words[x] is a substring of words[y] ) // '(built-in-function works here)'
          {
               if( length(words[x] ) < size )
               {
                     size = length(words[x];
                     index = x;
               }
          }
    }

 cout<< words[x]; 
 //its the desired result.. its pretty old school bu i think you get the idea

}

i wrote the code for... its working if you want it gimme you email i will send it to you... b.t.w i like this problem and the complexity of this algo is 3n(square)

Junaid Saeed 2010-08-22 10:52:46

p.s. there is a lot of PSEUDO code type of thingy that's where built in functions will come in... and i wrote the code for TC++...

Junaid Saeed 2010-08-22 10:57:51

p.s. (part2) i have done lexical analysis of C++ with just two loops using this filing method... and so my solutions complexity for the lexical analysis of C++ came out to be 2n :)

Junaid Saeed 2010-08-22 10:59:36

p.s. (part3) for interview questions its a good idea to explain them an algo or a technique instead of bombarding them with a list of built in functions that solve the problem....

Junaid Saeed 2010-08-22 11:10:32

woohh.. that was lengthy

Junaid Saeed 2010-08-22 11:11:34

i think i got the last part wrong ... when the words are in File... the longest of them is the required...

Junaid Saeed 2010-08-22 12:30:17

ansaurus

tags:

views:

answers:

Given two strings, find the longest common bag of chars

related questions