ansaurus

Question

Give two strings s1 and s2 , give an algorithm to find all characters from S1 which are also in S2.

Answer 1

+1 A:

Sort both of the strings into the same order, and then linearly scan through both sequences at the same time comparing each item, if it isn't the same, increment the sequence with the lower value character. It should be O(N log N) at a guess.

Mr Samuel 2010-09-21 09:21:47

sorting adds more overhead and complexity. Plus it might be an expensive operation if the string is long.

Sagar V 2010-09-21 09:25:50

I think O(NlogN + MlogM) which is near enough the same thing especially if you define N > M (where N and M are the string lengths).

Chris 2010-09-21 09:25:56

@Sagar V: possibly but if the other option is having to scan through the whole string each time you want to find a character in it then it migth be quicker in the long run...

Chris 2010-09-21 09:27:11

Answer 2

+1 A:

Off the top of my head I'd split both strings into a character array and sort them. I'd then go through one array and for each character check if its in the other. Having both arrays sorted means that you only need to loop through each array once when getting your result. I'm not sure if the overhead of sorting makes it any more efficient though.

Things that I have considered might be relevant are:

1) duplicate characters in a string - dismissed because off the top of my head I couldn't think of a nice efficient way of filtering out duplicates.

2) Different cases might want to be considered equal (eg A and a are considered the same) - not asked for but a toupper or tolower will solve that.

Chris 2010-09-21 09:23:28

Answer 3

A:

In ruby, algorithm O(len(s1) + len(s2)) using a hash table

require 'jcode'
def toto(s1, s2)
    contains={}
    s2.each_char { |x| contains[x]=1 }
    res=""
    s1.each_char do |x|
        if contains[x]==1
            res<<=x
        end
    end
    return res
end

puts toto("abcde","cdef")

yogsototh 2010-09-21 09:23:36

Answer 4

A:

Use a hash map where the keys are the characters of the first string and the values are an int starting at zero. Then just look up each character from S2 and if the lookup succeeds then increment the value. Setting up the hash map is O(n) and each lookup is O(1). Worst case performance is about O(2n) which equivalent to O(n).

Just for reference, you could you the Boost unordered_map container in the real world.

Robert S. Barnes 2010-09-21 09:25:10

Answer 5

+4 A:

Use some sort of set data structure: Fill the set with each character of S1. Then check for each character in S2 if it is in the set.

Depending on the implementation, both insertion and lookup on the set do only cost O(1).

Gumbo 2010-09-21 09:25:11

Damn, you must have answered while I was doing my second answer. How can I not give you a +1 when you've said the same thing. :)

Chris 2010-09-21 09:33:34

@Joey: Thanks for the remark although you have deleted your comment.

Gumbo 2010-09-21 09:36:29

Answer 6

A:

var commonChars = S1.Intersect(S2);

EDIT: This is from System.Linq - and it is an algorithm - it's a single, one-line algorithm. :-)

Enigmativity 2010-09-21 09:25:26

That's not an algorithm, that's a library function. I doubt he'd get away with this in an interview :)

Tim Pietzcker 2010-09-21 09:31:52

Language / platform information would be useful

Tx3 2010-09-21 09:32:52

@`Tim Pietzcker` - I would hope at an interviewer would recognise that a developer knew the most efficient way to code the solution. Efficient for the developer, but not necessarily the code execution, as that is where the biggest cost for an employer is.

Enigmativity 2010-09-21 22:53:47

Most efficient way to code the solution? Sure. Does it demonstrate his power for creative thinking? No. Hence the tag "interview-questions".If someone whips up their own intersect function when one exists, however, they deserve to be whipped themselves.

Christian Mann 2010-09-22 02:45:31

Answer 7

A:

Another method suggests itself to me now... In pseudo code since its easier to explain:

for each (character in stringA)
{
    charHashTable[character] = 1

}
for each (character in stringB)
{
    if (charHashTable[character]==1)
        if (!OutputArray.Contains(character))
            OutputArray.Add(character)
}

I'm not 100% sure of the performance characteristics of putting stuff into hash arrays but it occured to me that this might be a better way to efficiently find if a character is in one array than sorting it and doing it that way. If anybody can comment on the performance benefits of this method over the sorting both strings one I'd be very interested.

Chris 2010-09-21 09:31:08

Sorting strings should be O(NlogN), whereas hash map inserts/lookups are O(1), so the hash maps are faster. Still, O() analysis only becomes meaningful for dominant (i.e. very large) values of N. If the keys are a small set of contiguous values, then direct array indexing is much faster and more space efficient than a hash lookup, which consists of generating a hash number from the key (some maths transformations, bit shifting, and/or XORing) + having a sparse table of values to keep collisions to a manageable level.

Tony 2010-09-22 07:24:22

Answer 8

A:

If you know the number of unique characters that the strings are made up of (e.g.: If using only lower case characters then a-z means 26 unique symbols) then have an array of that size using the symbols as the index. Iterate over each string and for each of it's characters increment the corresponding array location (say you find b then a[1]++). In the end count the number of array positions which have a value of 2. Of course this solution has more space complexity.

Edit: Count the number of array positions that have 2 or more as the value. (Thanks to Chris for pointing out the flaw in the earlier answer)

Sagar V 2010-09-21 09:31:08

Wouldn't that fail in the case that string a was "aa" and string b was "bb" since both would have a value of 2 in your result array?

Chris 2010-09-21 09:32:18

Oh Yes! Thanks for pointing that out. I will edit my answer.

Sagar V 2010-09-21 09:35:44

I'm still not sure that it works in the example I gave before. Both of them would have values of 2 for a and b but neither of them are common characters...

Chris 2010-09-22 14:24:23

Answer 9

+5 A:

It's reasonable to assume the set of characters is small and finite compared to the potential string length. So, handling 8-bit characters for example (in roughly-C++ pseudo-code):

bool in_a[256] = { false };
bool in_b[256] = { false };
for (int i = 0; i < a.size(); ++i)
    in_a[a[i]] = true;
for (int i = 0; i < b.size(); ++i)
    in_b[b[i]] = true;
// and in_a and in_b
for (int i = 0; i < b.size(); ++i)
    if (in_a[i] && in_b[i])
    {
        std::cout << i;
        if (isprint(i)) std::cout << '\'' << (char)i << '\'';
        std::cout << ' ';
    }
std::cout << '\n';

Note that using a hash table rather than an array is a huge waste of time (unless handling say a 32-bit character representation).

An implementation of the simplification suggested in Yordan's comment follows. This avoids the final loop from 0..255 and needs only one tracking array. The order of results is unsorted.

#include <vector>
#include <iostream>

int main()
{
    std::string a = "abdcdefg";
    std::string b = "byfdz";
    bool track[256] = { false };
    for (int i = 0; i < a.size(); ++i)
        track[a[i]] = true;
    for (int i = 0; i < b.size(); ++i)
        if (track[b[i]])
        {
            track[b[i]] = false; // don't match again
            std::cout << (int)b[i];
            if (isprint(b[i])) std::cout << " \'" << (char)(b[i]) << "\'";
            std::cout << ' ';
        }
    std::cout << '\n';
}

Tony 2010-09-22 02:41:10

+1: Good answer. For character strings, anything else is overkill. The `isprint` carefulness is appreciable. BTW: (1) for `in_a` and `in_b` you could use `std::bitset<256>`. (2) In the third `for` loop, why it is `i < b.size()`? Won't it be `i < in_a.size()`?

ArunSaha 2010-09-23 05:22:49

Thanks. 1) bitset<256> is great for saving space: best when the lifetime outlives the function and there are lots of such sets, but bool[] will typically be faster (most CPUs will be slower doing lots of bit-shifting, ORs and ANDs, while bool will typically occupy an 'independent' byte), and as in_a and in_b are effectively temporaries on the stack - and not particularly big - I've optimised performance over space. 2) `i` is used to index into both in_a and in_b, so the choice was arbitrary. It would be slightly safer to "nail" one to the other ala `bool in_b[sizeof in_a]`, but confusing?

Tony 2010-09-24 00:50:19

@Tony For just solving the problem, I do not think you need the second array ( in_b[] ). The loop over b[] can also check in in_a[] and print the result.

Yordan Pavlov 2010-09-25 08:29:19

Yordan: good idea - implementation added to the answer. Thanks.

Tony 2010-09-27 01:33:40

ansaurus

tags:

views:

answers:

Give two strings s1 and s2 , give an algorithm to find all characters from S1 which are also in S2.

related questions