tags:

views:

195

answers:

10

I have a number of string arrays. The string in every array are ordered the same way, according the same criteria. However, some string may be missing from some arrays, and there may be no array that has a complete set of strings. Moreover, the criteria used to compare the strings is not available to me: outside of the context of the array, I cannot tell which string should precede another.

I need a way to produce a complete set of strings, properly ordered. Or fail when the arrays do not have enough information for me to do so.

Is anyone familiar with this kind of problem? What is the proper algorithm?

Examples:

A B D
A C D

Can't order correctly, can't decide the order of B and C

A B D
A B C
A C D

This has enough information to order ABCD correctly.

A: 

How many sets will you have? Something that might work if they aren't too sparse is:

  1. Iterate over the first element of all arrays counting how common each string is.
  2. Whichever element is the most common is selected.
  3. Check each element of all arrays for said string.
  4. If it exists in any element other than the first of an array, the first element of its parent array is selected and goto step 3.
  5. Remove the selected element from the first element of all arrays/trees and add it to the end of your output list.
  6. Goto 1 until all elements are removed.

If you dumped the arrays into b-trees the element search could be pretty fast. You may also be able to further speed up step 3 by playing with the order of the arrays. I'm still thinking about that.

If the arrays are very sparse the only thing I can think of is to use some kind of supervised machine learning to try to determine the proper method for ordering.

Rick Minerich
A: 

How does this differ from a classic diff/merge algorithm? As long as both arrays are "sorted", regardless of the mechanism that they are sorted. They are sorted compared to each other, so you should be able to use the similarities to merge the differences. After that, it's basically arbitrary.

If you want to merge [A] with [B, C, D], then you don't have any idea where A will go (front? end? Between C and D? You don't know). In those cases you'll just have to come up with a convention (always in front, always at end, something like that).

If you want to merge [Q, X, Y, A, B] with [Q, X, W, B, D], You'll need to decide if the "W" comes in front of or behind the "Y, A". So, is [Q, X, Y, A, W, B, D] correct or is [Q, X, W, Y, A, B, D] correct. That you'll just have to make a call on and roll with it, frankly. Simply not enough information.

Will Hartung
A: 

Partial solution:

You need to determine an ordering. You could do that by asking you data to "vote" on how many times one symbol is preceding another. You will need a square matrix with the size equal to the number of symbols. Initialize it to all zeros, then scan all your input strings, adding 1 to M(a,b) for each successive pair (a,b). Then, you need to clean out this data, so you get an ordering that is transitive (if A comes before B and be comes before C, A must also come before C). For this, you need to iterate through all triplets of distinct symbols, and "harmonize" the respective triplets of inequalities, so they respect the transitivity.

If your data is not too noisy, this should take only one pass. If the data is too noisy, you might need to run multiple passes until convergence, or to employ some greedy approach: instead of iterating over all symbols in a random order, iterate over all triplets of inequalities in the order of the decreasing number of votes.

florin
+5  A: 

One possible way I can think of, though probably not the most efficient:

Im going to use your example to explain:

A B D
A B C
A C D

create a an array of the unique characters, so you would get (for example):

A B D C

Also you should probably have a enum to describe the possible relationships

enum Type
{
    Unknown = 0,
    Greater = 1,
    Equal = 2,
    Less = 3,
}

now, create a square matrix whose dimensions are the same as the above array, default everything to Type.Unknown.

0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0

Your array will serve as an index into the matrix when you are figuring out the ordering. To see what i mean, look here:

  A B D C
A 0 0 0 0
B 0 0 0 0
D 0 0 0 0
C 0 0 0 0

Go through and make the diagonal as Type.Equal

2 0 0 0
0 2 0 0
0 0 2 0
0 0 0 2

Now you need to populate the matrix, loop through each input array and get each character and the one after it. Use these 2 characters, find the index in the matrix for each (using your array) and update the matrix.

for(int i=0; i<len(arr)-1; i++)
{
    char c1 = arr[i], c2 = arr[i+1];
    int i1,i2;
    //get indices of characters in array and put in i1, and i2 respectively
    matrix[i1][i2] = Type.Less;
    matrix[i2][i1] = Type.Greater
}

you assign 2 locations in the grid everytime, because when one char is less than another, it also means the second char is greater than the first.

Now you would simply insert the chars into an array one at a time (Linked List would be the easiest, you will see why in a second)

Each time you insert a char, you would start at the begining of your return array, and iterate through, looking up the indexes in the first array, and checking the matrix for a Type.Greater or a Type.Less (Depends on which way you are comparing, curr char to array, or array to current char) and insert it if you encounter a value different than what you expected.

If you hit a Type.Unknown in the matix during your insertion, you know you don't have enough info to sort these arrays, if you hit a Type.Equal, you can ignore inserting this char (assuming you don't want duplicates in your result.)

And then you would output your return array

TJMonk15
You described in more detail the first part of my solution, but you did not mention what happens when in half the input strings A is before B and in the other half, B is before A. You need to break the ties somehow, using the other symbols and the need for transitivity.
florin
Well, it's impossible to have BA and AB at the same time. But yes, transitivity needs to be accounted for.
Arkadiy
Arkadiy, these days I'm working with a lot of dirty data, so I assumed your data is dirty too ;)
florin
A: 

The main issue is that, given some data, you need to determine the ordering function. After that it is just list sort.

Produce a list of the first elements of all of your arrays. Remove duplicates. So this:

A B
A C
B C
C C
C D

becomes (A, B, C). We then want a set of pairs that represents every value we know to be less than something else. ((A,B),(A,C),(B,C)) for this example.

Now, for each value we can find at the beginning of an array, take all the arrays that begin with that value as a group. Remove their first elements. That gets us:

B
C

,

C

and

C
D

Recurse on each of these lists of arrays.

Once we've finished recursing on all our subsets, we merge the pairs that are produced into what we came up with initially. We should end up with ((A,B),(A,C),(B,C),(C,D)). Depending on the properties of your values, you could expand this out into ((A,B), (A,C), (A,D), (B,C), (B,D), (C,D)).

Your less than function now just looks to see if the values it is to compare are in this set. If so, true. Otherwise, false.

Jay Kominek
+4  A: 

This sounds like a special case of the topological sorting problem.

Doug McClean
+2  A: 

1) tsort the data. If the partial orderings are inconsistent, then there will be a loop and the tsort will fail. Otherwise you have a sequence s(0) ... s(n-1).

2) For each pair s(i), s(i+1), search in the partial sequences for the same pair appearing adjacent to one another. If you find it in one of them, then keep going for the next i. If you don't find it, then fail, because the partial sequences don't order s(i) and s(i+1).

Why not? Because if they did order them, but they don't appear next to each other, then there must be something "in between them". But if there's something in between them in the partial sequences but not in the full sequence you got out of the tsort, then that interloper must in the full sequence be either before s(i), or after s(i+1). That contradicts the consistency of the full sequence with the partial sequences, which tsort already proved.

Step 2 is O(n*m), where n is the number of elements to order, and m is the total length of all the partial sequences. You might be able to do better, but this is pretty simple because you can grab a tsort off the shelf for step 1, and step 2 is an obvious bunch of nested loops. Note that you can stop searching each partial sequence if you find s(i) (or s(i+1) for that matter), because it sure can't occur again.

Steve Jessop
A: 

Doug McClean wins. The phrase that leapt into my mind was stratification, and this led to dead ends in a web search. (For example, in logic programming, it is useful to speak of stratified programs; these are programs which you can visualize as a stack of layers, each layer containing one or more predicates, and each predicate being extensional (the bottom layer) or following only from predicates in layers beneath it. You ideally want a form of this where every layer has one string at most.)

Your variation of the toposort algorithm would have you build a DAG (you imply that acyclicity is guaranteed), perhaps tracking nodes with zero incoming edges as you go (an easy optimization). If, when done, you have more than one "root" node, you're done, as you know you have no first node. Otherwise, you have exactly one; remove it from the DAG, checking whether each of its child nodes is now a "root" node; and keep going until you get more than one root node (failure) or you run out of nodes (success).

The algorithm I sketched above should be linear in the number of strings (basically, two iterations through them). It's arguably proportional to the square of the number of strings, if each node has many children, implying that each string appears many, many times in your list of lists, each with a different string immediately following it.

Paul Brinkley
A: 

I am tempted to say that in order to have enough information to complete the merge, every pair of x,y that ends up next to each other in the final merged array has to be present somewhere in the input arrays. In other words, transitivity may not come into the picture at all. Can someone produce a counterexample?

You can do it right here in the answer if you so prefer.

Arkadiy
Correct; only way to prove x<y is by seeing x,y in some array, or x<z<y for some z, which fails to meet your condition. However, I doubt this insight gets you much in this case.
Paul Brinkley
Yes, according to the Topological Sorting theory as pointed out by @Doug McClean, for the sorting to be unique I have to have every x,y pair present in the arrays. That should make my life easier.
Arkadiy
I think my answer contains a proof that your temptation is correct (and more concise than my answer). As Paul says, the proof is to consider what happens the partial sequences impose an order on x,y, without listing them together. If so, then they can't be together in the final order either.
Steve Jessop
A: 

How about a simple recursive routine?

Using your second example:

A B D
A B C
A C D

Create a hash table of all the unique orderings:

table = {A -> B, C
         B -> C, D
         C -> D}

Count up all of the unique values

countUnique = 4

Using a stack, calculate all the possible paths. You know you've got a solution when the stack length matches the number of unique strings. If you find more than one solution, then you have a circular reference somewhere. Please excuse my questionable pseudocode.

main()
{
  foreach (s in table.keys)
  {
    stack.push(s)
    searchForPath()
    stack.pop()
  }
}

searchForPath()
{
  if (table.hasKey(stack.peek))
  {
    // we're not at the end yet
    foreach (s in table[stack.peek])
    {
      // avoid infinite recursion
      if (stack.contains(s)
        continue

      stack.push(s)
      searchForPath()
      stack.pop()
    }
  }
  else
  {
    if (stack.size == countUnique)
    {
      // solution found
      captureSolution()
    }
  }
}