views:

800

answers:

8

This came up in a real-world situation, and I thought I would share it, as it could lead to some interesting solutions. Essentially, the algorithm needs to diff two lists, but let me give you a more rigorous definition of the problem.

Mathematical Formulation

Suppose you have two lists, L and R each of which contain elements from some underlying alphabet S. Moreover, these lists have the property that the common elements that they have appear in order: that is to say, if L[i] = R[i*] and L[j] = R[j*], and i < j then i* < j*. The lists need not have any common elements at all, and one or both may be empty. [Clarification: You may assume no repetitions of elements.]

The problem is to produce a sort of "diff" of the lists, which may be viewed as new list of ordered pairs (x,y) where x is from L and y is from R, with the following properties:

  1. If x appears in both lists, then (x,x) appears in the result.
  2. If x appears in L, but not in R, then (x,NULL) appears in the result.
  3. If y appears in R, but not in L, then (NULL,y) appears in the result.

and finally

  • The result list has "the same" ordering as each of the input lists: it shares, roughly speaking, the same ordering property as above with each of the lists individually (see example).

Examples

L = (d)
R = (a,b,c)
Result = ((NULL,d), (a,NULL), (b,NULL), (c,NULL))

L = (a,b,c,d,e)  
R = (b,q,c,d,g,e)
Result = ((a,NULL), (b,b), (NULL,q), (c,c), (d,d), (NULL,g), (e,e))

Does anyone have any good algorithms to solve this? What is the complexity?

+1  A: 

Diffing ordered lists can be done in linear time by traversing both lists and matching as you go. I will try to post some psuedo Java code in an update.

Since we don't know the ordering algorithm and can't determine any ordering based on less than or greater than operators, we must consider the lists unordered. Also, given how the results are to be formatted you are faced with scanning both lists (at least until you find a match and then you can bookmark and start from there again). It will still be O(n^2) performance, or yes more specifically O(nm).

Mike Pone
That requires you to know the ordering algorithm. The second example (b,q,c,d,g,e) has a mystery ordering. (note the "q" and "g")
Jason S
Yes, please note that the letters just stand for arbitrary elements.
Jake
OK, so build custom less than, and greater than operators that take the mystery ordering into consideration. If we have an ordered list I must assume that we know how to order it or else we cannot consider it ordered.
Mike Pone
@Mike I just meant "ordered" in the mathematical sense: consider the list a n-tuple. I do not know anything about any partial or total order imposed on the underlying alphabet `S`.
Jake
@Mike: Since you realised that you can build your own "custom" less-than, what makes you think that this still takes O(nm) time? Why can't you sort both lists according to your "custom" less-than, merge them in O(n) time, then reorder the output list according to the mystery ordering of either L or R with an O(log n+m) sort?
j_random_hacker
A: 

This is a pretty simple problem since you already have an ordered list.

//this is very rough pseudocode
stack aList;
stack bList;
List resultList;
char aVal;
char bVal;

while(aList.Count > 0 || bList.Count > 0)
{
  aVal = aList.Peek; //grab the top item in A
  bVal = bList.Peek; //grab the top item in B

  if(aVal < bVal || bVal == null)
  {
     resultList.Add(new Tuple(aList.Pop(), null)));
  }
  if(bVal < aVal || aVal == null)
  {
     resultList.Add(new Tuple(null, bList.Pop()));
  }
  else //equal
  {
     resultList.Add(new Tuple(aList.Pop(), bList.Pop()));
  }
}

Note... this code WILL NOT compile. It is just meant as a guide.

EDIT Based on the OP comments

If the ordering algorithm is not exposed, then the lists must be considered unordered. If the lists are unordered, then the algorithm has a time complexity of O(n^2), specifically O(nm) where n and m are the number of items in each list.

EDIT Algorithm to solve this

L(a,b,c,d,e) R(b,q,c,d,g,e)

//pseudo code... will not compile
//Note, this modifies aList and bList, so make copies.
List aList;
List bList;
List resultList;
var aVal;
var bVal;

while(aList.Count > 0)
{
   aVal = aList.Pop();
   for(int bIndex = 0; bIndex < bList.Count; bIndex++)
   {
      bVal = bList.Peek();
      if(aVal.RelevantlyEquivalentTo(bVal)
      {
         //The bList items that come BEFORE the match, are definetly not in aList
         for(int tempIndex = 0; tempIndex < bIndex; tempIndex++)
         {
             resultList.Add(new Tuple(null, bList.Pop()));
         }
         //This 'popped' item is the same as bVal right now
         resultList.Add(new Tuple(aVal, bList.Pop()));

         //Set aVal to null so it doesn't get added to resultList again
         aVal = null;

         //Break because it's guaranteed not to be in the rest of the list
         break;
      }
   }
   //No Matches
   if(aVal != null)
   {
      resultList.Add(new Tuple(aVal, null));
   }
}
//aList is now empty, and all the items left in bList need to be added to result set
while(bList.Count > 0)
{
   resultList.Add(new Tuple(null, bList.Pop()));
}

The result set will be

L(a,b,c,d,e) R(b,q,c,d,g,e)

Result ((a,null),(b,b),(null,q),(c,c),(d,d),(null,g),(e,e))

Nope. Same comment as Mike Pone's answer; that requires you to know the ordering algorithm. The second example (b,q,c,d,g,e) has a mystery ordering. (note the "q" and "g")
Jason S
This doesn't work because the elements are not necessarily numbers and may not be compared with < or >.
Jake
you substitute whatever comparison is necessary. On the 'ordering algorithm part' if you don't know HOW the objects are ordered, then they cannot be considered 'ordered lists' from a programming perspective. I.E. given two order lists of "my favorite movies" and "his favorite movies" the algorithm would have to treat them as unordered lists.
@Devin You are correct, I am mixing up math and computer science terminology. I shall fix.
Jake
The trouble, again, comes when you try to use < and >. You may only test elements for equality.
Jake
If you may only test for equality, then it is O(m*n) there are ways to speed up the algorithm so that's the worst case, but the complexity will always be that.
and you can substitute "WouldAppearPriorTo" in place of "<", IF you knew the ordering algorithm
A: 

No real tangible answer, only vague intuition. Because you don't know the ordering algorithm, only that the data is ordered in each list, it sounds vaguely like the algorithms used to "diff" files (e.g. in Beyond Compare) and match sequences of lines together. Or also vaguely similar to regexp algorithms.

There can also be multiple solutions. (never mind, not if there are not repeated elements that are strictly ordered. I was thinking too much along the lines of file comparisons)

Jason S
A: 

I don't think you have enough information. All you've asserted is that elements that match match in the same order, but finding the first matching pair is an O(nm) operation unless you have some other ordering that you can determine.

Yuliy
+3  A: 

The worst case, as defined and using only equality, must be O(n*m). Consider the following two lists:

A[] = {a,b,c,d,e,f,g}

B[] = {h,i,j,k,l,m,n}

Assume there exists exactly one match between those two "ordered" lists. It will take O(n*m) comparisons since there does not exist a comparison which removes the need for other comparisons later.

So, any algorithm you come up with is going to be O(n*m), or worse.

Brian
Is there a way to take an intersection of two lists in less than O(n^2)? If so, we can make this faster.
Jake
No, there is not. If you have two lists with n and m items and the size of the intersection is 1. Then you will need an average of 0.5*n*m comparisons to find the intersection, even if you know in advance the size of the intersection.
Brian
Please see Mark Ransom's post.
Jake
His solution won't work, as it requires a hashing function. The existence of an equality check does not guarantee the existence of a hashing function. Further, it does not guarantee that the hashing function isn't something stupid like "return 1;".
Brian
-1. Sort each list independently, by whatever order you like, in O(log n) time. (This will destroy the "structure" guaranteed by Jake's 2nd paragraph, but that doesn't matter.) Then use an O(n) list merge to find the intersection -- essentially, compare the top two elements, output the lesser appropriately, remove it from the top of its list, rinse, repeat.
j_random_hacker
Addendum: since output is required to be in "the same order" as the input, you'll actually need to buffer the output list and reorder it using the relation defined by L[i] < L[i+1] for all i. An efficient implementation would use an integer array X of size n, setting X[S[i]] = i as the ith symbol is read from L, and an additional O(log n) sorting pass.
j_random_hacker
j_random_hacker: The author specifically said numbers may not be compared with < . I assume > is also disallowed.
Brian
@Brian: Where does he say that? All I can see is that the particular ordering shared by L and R is not known to us. We are not prohibited from inventing our own orderings (which can always be done for any set of values that can be represented on a computer -- e.g. just use its binary representation, interpreted as an integer).
j_random_hacker
j_random_hacker: He says that in response to devinb's answer, as a comment. He implies it in response to Mike Pone's answer. The "sort stuff, then apply matching algorithm" strategy idea is obvious, but has already been rejected. And even if it wasn't, take note that my post explicitely states that I am suggesting a solution using only equality, so whether Jake states it or not is not relevant to my answer's accuracy.
Brian
@Brian: You're right, the OP forbids < in comments to devinb's answer. (OTOH I don't take the response to Mike Pone's answer as forbidding invention of ad hoc ordering systems.) I've dropped the -1 since, as you say, you explicitly assume this up front. But I think the OP is confused -- forbidding the invention of ad-hoc orderings is an absurd restriction, since an ordering can be trivially constructed for *any* representable set of values (even from a countably infinite set) and it's clearly beneficial to do so here.
j_random_hacker
j_random_hacker: An ordering cannot be constructed if you are taking the view of a purist. Jake is treating this as a math problem as much as a programming problem. And if you are working with a compiled class written by someone else that lacks <, adding it is nontrivial...especially if all your class function is allowed to promise (programming by contract is in C++0x, IIRC) is the existence of equality.
Brian
j_random_hacker
+1  A: 

There is a way to do this in O(n), if you're willing to make a copy of one of the lists in a different data structure. This is a classic time/space tradeoff.

Create a hash map of the list R, with the key being the element and the value being the original index into the array; in C++, you could use unordered_map from tr1 or boost.

Keep an index to the unprocessed portion of list R, initialized to the first element.

For each element in list L, check the hash map for a match in list R. If you do not find one, output (L value, NULL). If there is a match, get the corresponding index from the hash map. For each unprocessed element in list R up to the matching index, output (NULL, R value). For the match, output (value, value).

When you have reached the end of list L, go through the remaining elements of list R and output (NULL, R value).

Edit: Here is the solution in Python. To those who say this solution depends on the existence of a good hashing function - of course it does. The original poster may add additional constraints to the question if this is a problem, but I will take an optimistic stance until then.

def FindMatches(listL, listR):
    result=[]
    lookupR={}
    for i in range(0, len(listR)):
        lookupR[listR[i]] = i
    unprocessedR = 0
    for left in listL:
        if left in lookupR:
            for right in listR[unprocessedR:lookupR[left]]:
                result.append((None,right))
            result.append((left,left))
            unprocessedR = lookupR[left] + 1
        else:
            result.append((left,None))
    for right in listR[unprocessedR:]:
        result.append((None,right))
    return result

>>> FindMatches(('d'),('a','b','c'))
[('d', None), (None, 'a'), (None, 'b'), (None, 'c')]
>>> FindMatches(('a','b','c','d','e'),('b','q','c','d','g','e'))
[('a', None), ('b', 'b'), (None, 'q'), ('c', 'c'), ('d', 'd'), (None, 'g'), ('e','e')]
Mark Ransom
The efficient speed of a hashmap is dependent upon the existence of a good hashing function. Jake has not even promised that one exists. If one *does* exist, you could just as easily sort them by their hashcode and do a standard ordered comparsion, though of course that's O(nlogn).
Brian
Now I'm curious - why did adding a working code sample deserve a -1 vote?
Mark Ransom
A: 

This is exactly like sequence alignment, you can use the Needleman-Wunsch algorithm to solve it. The link includes the code in Python. Just make sure you set the scoring so that a mismatch is negative and a match is positive and an alignment with a blank is 0 when maximizing. The algorithm runs in O(n * m) time and space, but the space complexity of this can be improved.

Scoring Function

int score(char x, char y){
 if ((x == ' ') || (y == ' ')){
  return 0;
 }
 else if (x != y){
  return -1;
 }
 else if (x == y){
  return 1;
 }
 else{
  puts("Error!");
  exit(2);
 }
}

Code

#include <stdio.h>
#include <stdbool.h>

int max(int a, int b, int c){
 bool ab, ac, bc;
 ab = (a > b);
 ac = (a > c);
 bc = (b > c);
 if (ab && ac){
  return a;
 }
 if (!ab && bc){
  return b;
 }
 if (!ac && !bc){
  return c;
 }
}

int score(char x, char y){
 if ((x == ' ') || (y == ' ')){
  return 0;
 }
 else if (x != y){
  return -1;
 }
 else if (x == y){
  return 1;
 }
 else{
  puts("Error!");
  exit(2);
 }
}


void print_table(int **table, char str1[], char str2[]){
 unsigned int i, j, len1, len2;
 len1 = strlen(str1) + 1;
 len2 = strlen(str2) + 1;
 for (j = 0; j < len2; j++){
  if (j != 0){
   printf("%3c", str2[j - 1]);
  }
  else{
   printf("%3c%3c", ' ', ' ');
  }
 }
 putchar('\n');
 for (i = 0; i < len1; i++){
  if (i != 0){
   printf("%3c", str1[i - 1]);
  }
  else{
   printf("%3c", ' ');
  }
  for (j = 0; j < len2; j++){
   printf("%3d", table[i][j]);
  }
  putchar('\n');
 }
}

int **optimal_global_alignment_table(char str1[], char str2[]){
 unsigned int len1, len2, i, j;
 int **table;
 len1 = strlen(str1) + 1;
 len2 = strlen(str2) + 1;
 table = malloc(sizeof(int*) * len1);
 for (i = 0; i < len1; i++){
  table[i] = calloc(len2, sizeof(int));
 }
 for (i = 0; i < len1; i++){
  table[i][0] += i * score(str1[i], ' ');
 }
 for (j = 0; j < len1; j++){
  table[0][j] += j * score(str1[j], ' ');
 }
 for (i = 1; i < len1; i++){
  for (j = 1; j < len2; j++){
   table[i][j] = max(
    table[i - 1][j - 1] + score(str1[i - 1], str2[j - 1]),
    table[i - 1][j] + score(str1[i - 1], ' '),
    table[i][j - 1] + score(' ', str2[j - 1])
   );
  }
 }
 return table;
}

void prefix_char(char ch, char str[]){
 int i;
 for (i = strlen(str); i >= 0; i--){
  str[i+1] = str[i];
 } 
 str[0] = ch;
}

void optimal_global_alignment(int **table, char str1[], char str2[]){
 unsigned int i, j;
 char *align1, *align2;
 i = strlen(str1);
 j = strlen(str2);
 align1 = malloc(sizeof(char) * (i * j));
 align2 = malloc(sizeof(char) * (i * j));
 align1[0] = align2[0] = '\0';
 while((i > 0) && (j > 0)){
  if (table[i][j] == (table[i - 1][j - 1] + score(str1[i - 1], str2[j - 1]))){
   prefix_char(str1[i - 1], align1);
   prefix_char(str2[j - 1], align2);
   i--;
   j--;
  }
  else if (table[i][j] == (table[i - 1][j] + score(str1[i-1], ' '))){
   prefix_char(str1[i - 1], align1);
   prefix_char('_', align2);
   i--;
  }
  else if (table[i][j] == (table[i][j - 1] + score(' ', str2[j - 1]))){
   prefix_char('_', align1);
   prefix_char(str2[j - 1], align2);
   j--;
  }
 }
 while (i > 0){
  prefix_char(str1[i - 1], align1);
  prefix_char('_', align2);
  i--;
 }
 while(j > 0){
  prefix_char('_', align1);
  prefix_char(str2[j - 1], align2);
  j--;
 }
 puts(align1);
 puts(align2);
}

int main(int argc, char * argv[]){
 int **table;
 if (argc == 3){
  table = optimal_global_alignment_table(argv[1], argv[2]);
  print_table(table, argv[1], argv[2]);
  optimal_global_alignment(table, argv[1], argv[2]);
 }
 else{
  puts("Reqires to string arguments!");
 }
 return 0;
}

Sample IO

$ cc dynamic_programming.c && ./a.out aab bba
__aab
bb_a_
$ cc dynamic_programming.c && ./a.out d abc
___d
abc_
$ cc dynamic_programming.c && ./a.out abcde bqcdge
ab_cd_e
_bqcdge
Nixuz
A: 

SELECT distinct l.element, r.element
FROM LeftList l
OUTER JOIN RightList r
ON l.element = r.element
ORDER BY l.id, r.id

Assumes the ID of each element is its ordering. And of course, that your lists are contained in a Relational Database :)

Josh Smeaton