views:

566

answers:

7

I've got a collection of strings, and I need to know the first index where they all differ. I can think of two ways to do this: (the following pseudo code is just off the top of my head and may be heavily bug-laden)

First Way:

var minLength = [go through all strings finding min length];
var set = new set()
for(i=0;i<minlength;i++)
{
  for(str in strings)
  {
    var substring = str.substring(0,i);
    if(set.contains(substring))
      break; // not all different yet, increment i
    set.add(substring)
  }
  set.clear(); // prepare for next length of substring
}

This strikes me as gross because of the use of a set data structure where it seems like one should not be needed.

Second Way:

var minLength = [go through all strings finding min length];
strings.sort();
for(i=0;i<minlength;i++)
{
  boolean done = true;
  char last = null;
  for(str in strings)
  {
    char c = str[i];
    if(c == last)
    {
      // not all different yet, increment i
      done = false;
      break;
    }
    last = c;
  }
  if(done)
    return i;
}

But it annoys me that I have to run the sort first, because the sorting algorithm, by its very nature, has access to the information that I'm looking for.

Surely there must be a more efficient way than what I have listed above. Eventually I'd like to abstract it out to any type of array, but that will be trivial and it's simpler to think of it as a string problem.

Any help?

**UPDATE: I apparently didn't explain myself very well. If my strings are ["apple", "banana", "cucumber", "banking"], I want the function to return 3, because there were two strings ("banana" and "banking") that matched through index 0, 1, and 2, so 3 is the first index where they are all unique.

As Daniel mentioned below, a better way to state my needs is that: "I want to find index i where calling substring(0,i) on all my strings will result in all unique values."**

+1  A: 

Use the set as you proposed, that's exactly the right thing to do.

Georg
A: 
int i = 0;
while(true)
{
    Set set = new Set();
    for(int j = 0; j < strings.length; j++)
    {
         if(i >= strings[j].length) return i;
         String chr = strings[j].charAt(i);
         if(set.hasElement(chr))
             break;
         else
             set.addElement(chr);
    }
    if(set.size() == strings.length)
        return i;
    i++;
}

Gotta check preconditions first.

EDIT: Using a set now. Changed langauge.

CookieOfFortune
Nice "set.size() == strings.length" to check if you've made it all the way through the strings.So my initial instincts were correct?
Erik R.
I believe that using a set and then checking the number of elements in it to be the easiest option. I'm actually going to edit this again to use a function on the inside.
CookieOfFortune
+3  A: 

This is untested, but here's my attempt. (I may be making it more complicated than I have to, but I think it's a different way to look at it.)

The basic idea is to compile groups of items that match at the first element, then find the max unique index for each group, checking elements at each successive index.

int FirstUniqueIndex<T>(IEnumerable<IEnumerable<T>> myArrayCollection)
{
    //just an overload so you don't have to specify index 0 all the time
    return FirstUniqueIndex(myArrayCollection, 0);
}

int FirstUniqueIndex<T>(IEnumerable<IEnumerable<T>> myArrayCollection, int StartIndex)
{
    /* Group the current collection by the element at StartIndex, and
     * return a collection of these groups. Additionally, we're only interested
     * in the groups with more than one element, so only get those.*/

    var groupsWithMatches = from var item in myArrayCollection //for each item in the collection (called "item")
                            where item.Length > StartIndex //that are long enough
                            group by item[StartIndex] into g //group them by the element at StartIndex, and call the group "g"
                            where g.Skip(1).Any() //only want groups with more than one element
                            select g; //add the group to the collection

    /* Now "groupsWithMatches" is an enumeration of groups of inner matches of
     * your original arrays. Let's process them... */

    if(groupsWithMatches.Any()) 
        //some matches were found - check the next index for each group
        //(get the maximum unique index of all the matched groups)
        return groupsWithMatches.Max(group => FirstUniqueIndex(group, StartIndex + 1));
    else
        //no matches found, all unique at this index
        return StartIndex;
}


And for the non-LINQ version of the above (I'll change it to use a List collection, but any collection will do). I'll even remove the lambda. Again untested, so try not to aim sharp implements in my direction.

int FirstUniqueIndex<T>(List<List<T>> myArrayCollection, int StartIndex)
{
    /* Group the current collection by the element at StartIndex, and
     * return a collection of these groups. Additionally, we're only interested
     * in the groups with more than one element, so only get those.*/

    Dictionary<T, List<List<T>>> groupsWithMatches = new Dictionary<T, List<List<T>>>();

    //group all the items by the element at StartIndex
    foreach(var item in myArrayCollection)
    {
        if(item.Count > StartIndex)
        {
            List<List<T>> group;
            if(!groups.TryGetValue(item[StartIndex], out group))
            {
                //new group, so make it first
                group = new List<List<T>>();
                groups.Add(item[StartIndex], group);
            }

            group.Add(Item);
        }
    }

    /* Now "groups" is an enumeration of groups of inner matches of
     * your original arrays. Let's get the groups with more than one item. */

    List<List<List<T>>> groupsWithMatches = new List<List<List<T>>>(groups.Count);

    foreach(List<List<T> group in groupsWithMatches)
    {
        if(group.Count > 1)
            groupsWithMatches.Add(group);
    }

    if(groupsWithMatches.Count > 0)
    {
        //some matches were found - check the next index for each group
        //(get the maximum unique index of all the matched groups)

        int max = -1;
        foreach(List<List<T>> group in groupsWithMatches)
        {
            int index = FirstUniqueIndex(group, StartIndex + 1);
            max = index > max ? index : max;
        }
        return max;
    }
    else
    {
        //no matches found, all unique at this index
        return StartIndex;
    }
}
lc
Can I ask what language that is? Is the groupWithMatches some kind of predicate definition?It seems like recursion is probably not the way to go here, but perhaps that depends on language and compiler.
Erik R.
C#, and yes, groupWithMatches is a LINQ query (basically a predicate definition). I'll revise my answer to explain the algorithm a little more. As for which way to go, I think it depends on the particular collection. This method is probably faster with a long collection where only a few items actually match each other, since it disregards items which don't match each other on each pass (instead of checking each item for uniqueness at every index).
lc
I had the same idea but I didn't want to give a LINQ answer ... it's C#.
Daniel Brückner
It's C#, and I like this approach a lot.
mquander
very interesting solution...
Demi
This is very academically interesting to me, but ultimately useless as I am programming in java and do not have these LINQ beasts to send into battle.
Erik R.
Ok, I'll try to cook up a non-LINQ version then...
lc
@Erik R.: A summary of the algorithm is: Start at index 0. For index N, separate all the strings into buckets based on their Nth character. If no bucket has multiple strings, then return N. Else, take each bucket with multiple strings individually and apply the function to that bucket on index N+1; return the maximum of all the results thereby obtained.
mquander
@mquander: Thanks for that :)
lc
Thanks, mquander. How is this any better than my instinct to use a set (java.util.HashSet in my case) to hold the results at each N length? It seems pretty similar.
Erik R.
I think that if there's an performance advantage to this version, it's probably less about the lack of a set and more about the lack of the "substring" operation. Taking substrings is pretty slow compared to keeping track of an index and "ignoring" the earlier parts of the string. My instinct is that this solution will be fairly well faster thanks to that.
mquander
That said, if you have to allocate intermediary state for "bucket" objects, which are fulfilling the same purpose that your hashtable does, this solution may be more awkward for you to implement. It's to LINQ's credit that it provides a way to process this in a purely functional manner without explicitly constructing in-between collections for the groups of strings.
mquander
+1  A: 

You should be able to do this without sorting, and with only looking at each character in each string once in the worst case.

here is a ruby script that puts the index to the console:

mystrings = ["apple", "banana", "cucumber", "banking"]
minlength = getMinLengthString(mystrings) #not defined here

char_set = {}

(0..minlength).each do |char_index|
  char_set[mystrings[0][char_index].chr] = 1
  (1..mystrings.length).each do |string_index|
    comparing_char = mystrings[string_index][char_index].chr
    break if char_set[comparing_char]
    if string_index == (mystrings.length - 1) then
      puts string_index
      exit
    else
      char_set[comparing_char] = 1
    end     
  end
  char_set.clear
end
puts minlength

the result is 3.

Here's the same general snippet in C#, if it is more legible for you:

string[] mystrings = { "apple", "banana", "cucumber", "banking" };

//defined elsewhere...
int minlength = GetMinStringLengthFromStringArray(mystrings);

Dictionary<char, int> charSet = new Dictionary<char, int>();

for (int char_index = 0; char_index < minlength; char_index++)
{
    charSet.Add(mystrings[0][char_index], 1);

    for (int string_index = 1; string_index < mystrings.Length; string_index++)
    {
        char comparing_char = mystrings[string_index][char_index];

        if (charSet.ContainsKey(comparing_char))
        {
             break;
        }
        else
        {
             if (string_index == mystrings.Length - 1)
             {
                  Console.Out.WriteLine("Index is: " + string_index.ToString());
                  return;
             }
             else
             {
                  charSet.Add(comparing_char, 1);
             }
        }
    }

    charSet.Clear();
}
Console.Out.WriteLine("Index is: " + minlength.ToString());
Demi
+1  A: 

Agree w/ gs, use of set is appropriate. Your p-code translated to python, lightly tested:

minlen = min( len( x ) for x in strings )
myset = set()
for i in range( minlen ):
    for s in strings:
        sub = s[:i+1]
        if sub in myset:
            break
        myset.add( sub )
    if len( myset ) == len( strings ):
        print i
        break
    myset.clear()

With each iteration through strings, you need to check for existence of a value against all previously encountered values. That says hash- or set-type structure to me.

John Pirie
A: 

Here's my solution in Python:

words = ["apple", "banana", "cucumber", "banking"]

for i in range(len(min(words))):
    d = defaultdict(int)
    for word in words:
        d[word[i]] += 1
    if max(d.values()) == 1:
        return i

I didn't write in anything to handle the case where no minimum index is found by the time you reach the end of the shortest word, but I'm sure you get the idea.

Kamil Kisiel
+2  A: 
Jason S