ansaurus

Question

Algorithm to find first index where strings are different?

Answer 1

+1 A:

Use the set as you proposed, that's exactly the right thing to do.

Georg 2009-05-20 16:40:03

Answer 2

A:

int i = 0;
while(true)
{
    Set set = new Set();
    for(int j = 0; j < strings.length; j++)
    {
         if(i >= strings[j].length) return i;
         String chr = strings[j].charAt(i);
         if(set.hasElement(chr))
             break;
         else
             set.addElement(chr);
    }
    if(set.size() == strings.length)
        return i;
    i++;
}

Gotta check preconditions first.

EDIT: Using a set now. Changed langauge.

CookieOfFortune 2009-05-20 16:40:31

Nice "set.size() == strings.length" to check if you've made it all the way through the strings.So my initial instincts were correct?

Erik R. 2009-05-20 17:15:12

I believe that using a set and then checking the number of elements in it to be the easiest option. I'm actually going to edit this again to use a function on the inside.

CookieOfFortune 2009-05-20 17:18:41

Answer 3

+3 A:

This is untested, but here's my attempt. (I may be making it more complicated than I have to, but I think it's a different way to look at it.)

The basic idea is to compile groups of items that match at the first element, then find the max unique index for each group, checking elements at each successive index.

int FirstUniqueIndex<T>(IEnumerable<IEnumerable<T>> myArrayCollection)
{
    //just an overload so you don't have to specify index 0 all the time
    return FirstUniqueIndex(myArrayCollection, 0);
}

int FirstUniqueIndex<T>(IEnumerable<IEnumerable<T>> myArrayCollection, int StartIndex)
{
    /* Group the current collection by the element at StartIndex, and
     * return a collection of these groups. Additionally, we're only interested
     * in the groups with more than one element, so only get those.*/

    var groupsWithMatches = from var item in myArrayCollection //for each item in the collection (called "item")
                            where item.Length > StartIndex //that are long enough
                            group by item[StartIndex] into g //group them by the element at StartIndex, and call the group "g"
                            where g.Skip(1).Any() //only want groups with more than one element
                            select g; //add the group to the collection

    /* Now "groupsWithMatches" is an enumeration of groups of inner matches of
     * your original arrays. Let's process them... */

    if(groupsWithMatches.Any()) 
        //some matches were found - check the next index for each group
        //(get the maximum unique index of all the matched groups)
        return groupsWithMatches.Max(group => FirstUniqueIndex(group, StartIndex + 1));
    else
        //no matches found, all unique at this index
        return StartIndex;
}

And for the non-LINQ version of the above (I'll change it to use a List collection, but any collection will do). I'll even remove the lambda. Again untested, so try not to aim sharp implements in my direction.

int FirstUniqueIndex<T>(List<List<T>> myArrayCollection, int StartIndex)
{
    /* Group the current collection by the element at StartIndex, and
     * return a collection of these groups. Additionally, we're only interested
     * in the groups with more than one element, so only get those.*/

    Dictionary<T, List<List<T>>> groupsWithMatches = new Dictionary<T, List<List<T>>>();

    //group all the items by the element at StartIndex
    foreach(var item in myArrayCollection)
    {
        if(item.Count > StartIndex)
        {
            List<List<T>> group;
            if(!groups.TryGetValue(item[StartIndex], out group))
            {
                //new group, so make it first
                group = new List<List<T>>();
                groups.Add(item[StartIndex], group);
            }

            group.Add(Item);
        }
    }

    /* Now "groups" is an enumeration of groups of inner matches of
     * your original arrays. Let's get the groups with more than one item. */

    List<List<List<T>>> groupsWithMatches = new List<List<List<T>>>(groups.Count);

    foreach(List<List<T> group in groupsWithMatches)
    {
        if(group.Count > 1)
            groupsWithMatches.Add(group);
    }

    if(groupsWithMatches.Count > 0)
    {
        //some matches were found - check the next index for each group
        //(get the maximum unique index of all the matched groups)

        int max = -1;
        foreach(List<List<T>> group in groupsWithMatches)
        {
            int index = FirstUniqueIndex(group, StartIndex + 1);
            max = index > max ? index : max;
        }
        return max;
    }
    else
    {
        //no matches found, all unique at this index
        return StartIndex;
    }
}

lc 2009-05-20 16:57:48

Can I ask what language that is? Is the groupWithMatches some kind of predicate definition?It seems like recursion is probably not the way to go here, but perhaps that depends on language and compiler.

Erik R. 2009-05-20 17:20:33

C#, and yes, groupWithMatches is a LINQ query (basically a predicate definition). I'll revise my answer to explain the algorithm a little more. As for which way to go, I think it depends on the particular collection. This method is probably faster with a long collection where only a few items actually match each other, since it disregards items which don't match each other on each pass (instead of checking each item for uniqueness at every index).

lc 2009-05-20 17:27:42

I had the same idea but I didn't want to give a LINQ answer ... it's C#.

Daniel Brückner 2009-05-20 17:28:02

It's C#, and I like this approach a lot.

mquander 2009-05-20 17:28:48

very interesting solution...

Demi 2009-05-20 17:33:05

This is very academically interesting to me, but ultimately useless as I am programming in java and do not have these LINQ beasts to send into battle.

Erik R. 2009-05-20 18:07:18

Ok, I'll try to cook up a non-LINQ version then...

lc 2009-05-20 18:28:24

@Erik R.: A summary of the algorithm is: Start at index 0. For index N, separate all the strings into buckets based on their Nth character. If no bucket has multiple strings, then return N. Else, take each bucket with multiple strings individually and apply the function to that bucket on index N+1; return the maximum of all the results thereby obtained.

mquander 2009-05-20 18:45:22

@mquander: Thanks for that :)

lc 2009-05-20 18:47:17

Thanks, mquander. How is this any better than my instinct to use a set (java.util.HashSet in my case) to hold the results at each N length? It seems pretty similar.

Erik R. 2009-05-20 19:35:11

I think that if there's an performance advantage to this version, it's probably less about the lack of a set and more about the lack of the "substring" operation. Taking substrings is pretty slow compared to keeping track of an index and "ignoring" the earlier parts of the string. My instinct is that this solution will be fairly well faster thanks to that.

mquander 2009-05-20 20:24:11

That said, if you have to allocate intermediary state for "bucket" objects, which are fulfilling the same purpose that your hashtable does, this solution may be more awkward for you to implement. It's to LINQ's credit that it provides a way to process this in a purely functional manner without explicitly constructing in-between collections for the groups of strings.

mquander 2009-05-20 20:28:02

Answer 4

+1 A:

You should be able to do this without sorting, and with only looking at each character in each string once in the worst case.

here is a ruby script that puts the index to the console:

mystrings = ["apple", "banana", "cucumber", "banking"]
minlength = getMinLengthString(mystrings) #not defined here

char_set = {}

(0..minlength).each do |char_index|
  char_set[mystrings[0][char_index].chr] = 1
  (1..mystrings.length).each do |string_index|
    comparing_char = mystrings[string_index][char_index].chr
    break if char_set[comparing_char]
    if string_index == (mystrings.length - 1) then
      puts string_index
      exit
    else
      char_set[comparing_char] = 1
    end     
  end
  char_set.clear
end
puts minlength

the result is 3.

Here's the same general snippet in C#, if it is more legible for you:

string[] mystrings = { "apple", "banana", "cucumber", "banking" };

//defined elsewhere...
int minlength = GetMinStringLengthFromStringArray(mystrings);

Dictionary<char, int> charSet = new Dictionary<char, int>();

for (int char_index = 0; char_index < minlength; char_index++)
{
    charSet.Add(mystrings[0][char_index], 1);

    for (int string_index = 1; string_index < mystrings.Length; string_index++)
    {
        char comparing_char = mystrings[string_index][char_index];

        if (charSet.ContainsKey(comparing_char))
        {
             break;
        }
        else
        {
             if (string_index == mystrings.Length - 1)
             {
                  Console.Out.WriteLine("Index is: " + string_index.ToString());
                  return;
             }
             else
             {
                  charSet.Add(comparing_char, 1);
             }
        }
    }

    charSet.Clear();
}
Console.Out.WriteLine("Index is: " + minlength.ToString());

Demi 2009-05-20 17:14:18

Answer 5

+1 A:

Agree w/ gs, use of set is appropriate. Your p-code translated to python, lightly tested:

minlen = min( len( x ) for x in strings )
myset = set()
for i in range( minlen ):
    for s in strings:
        sub = s[:i+1]
        if sub in myset:
            break
        myset.add( sub )
    if len( myset ) == len( strings ):
        print i
        break
    myset.clear()

With each iteration through strings, you need to check for existence of a value against all previously encountered values. That says hash- or set-type structure to me.

John Pirie 2009-05-20 19:23:45

Answer 6

A:

Here's my solution in Python:

words = ["apple", "banana", "cucumber", "banking"]

for i in range(len(min(words))):
    d = defaultdict(int)
    for word in words:
        d[word[i]] += 1
    if max(d.values()) == 1:
        return i

I didn't write in anything to handle the case where no minimum index is found by the time you reach the end of the shortest word, but I'm sure you get the idea.

Kamil Kisiel 2009-05-20 21:12:11

Answer 7

+2 A:

Jason S 2009-05-20 21:31:01

ansaurus

tags:

views:

answers:

Algorithm to find first index where strings are different?

related questions