views:

922

answers:

8

I have a sorted list of inputs:

let x = [2; 4; 6; 8; 8; 10; 12]
let y = [-8; -7; 2; 2; 3; 4; 4; 8; 8; 8;]

I want to write a function which behaves similar to an SQL INNER JOIN. In other words, I want to return the cartesian product of x and y which contains only items shared in both lists:

join(x, y) = [2; 2; 4; 4; 8; 8; 8; 8; 8; 8]

I've written a naive version as follows:

let join x y =
    [for x' in x do
        for y' in y do
            yield (x', y')]
    |> List.choose (fun (x, y) -> if x = y then Some x else None)

It works, but this runs in O(x.length * y.length). Since both my lists are sorted, I think its possible to get the results I want in O(min(x.length, y.length)).

How can I find common elements in two sorted lists in linear time?

+7  A: 

I can't help you with the F#, but the basic idea is to use two indices, one for each list. Choose the item in each list at the current index for that list. If the two items are the same value, then add that value to your result set and increment both indices. If the items have different values, increment just the index for the list containing the lesser of the two values. Repeat the comparison until one of your lists is empty and then return the result set.

tvanfosson
I think this approach might need some modification considering he wants a cartesian-product-style repetition of elements in the result set.
mquander
Easily modified though, whenever you find a match, find out the range of the matches in both lists, and produce the cartesian result, before moving on past the matches.
Lasse V. Karlsen
@mquander: she, not he ;)
Juliet
I was thinking sets, but you could adjust for lists. Simply keep a placeholder for an element where you find a match, go forward on one list as long as you keep finding matches, then back up and increment the other list forward one. Repeat this until all matches are consumed.
tvanfosson
...of course, in the worst case this is still O(n*m), but I don't think that can be helped.
tvanfosson
Note that indexing for F# lists is itself an O(n) operation - since they're singly linked lists.
Pavel Minaev
The data example that he gives is *NOT* a Cartesian Product (which is the equivalent of a SQL CROSS JOIN), rather, it is a SQL INNER JOIN, which is like an Intersection, except that it is still a bijection (like a Cartesian Product).
RBarryYoung
To restate: he *calls* it a Cartesian product, but if you look at the data example, it is NOT a Cartesian product.
RBarryYoung
*sigh*, my bad, I miscounted the matches in the inputs and outputs...
RBarryYoung
@tvanvosson - Assuming O(1) indexing, I don't see how your original algorithm ever is as bad as O(nm). You always increment at least one index. Since the algorithm ends once either index goes out of bounds, you can never take more than O(n+m) steps. Also, in your second (comment) algorithm, I believe you never need to backtrack as the lists are sorted?
Joren
The original isn't. I was commenting on the alteration to produce the cartesian product, which has back tracking and in the case where all elements in each lists are the same, is O(n*m).
tvanfosson
It's a List. Lists have lousy indexed access performance, so any solution using indices will be truly awful.
Daniel
@tvanfosson: As for O(nm), that's a lower bound for this problem -- look up my answer as to why.
Daniel
I misunderstood the question before. You're right, O(nm) is definitely a lower bound on the worst case. On the other hand, in the best case list A contains only x and list B only y != x. Since the lists are sorted, you can conclude A contains only x from seeing that the first element is x and that the last element is x. Same for B, but with y. From there you easily conclude that since x != y the solution is the empty set. You did only 4 list accesses: best case is O(1).
Joren
No. I don't think O(n*m) is a lower bound on the algorithm as a I described in my comments. In the worst case -- x === y, yes it is O(n*m), but that's the biggest, not the smallest number of element in the product. Taking advantage of the fact that the lists are sorted gives you better performance in virtually every other case as you don't have to visit elements that you know are too small.
tvanfosson
@tvanfosson: the big-O of an algorithm is the worst case by definition.
Daniel
@Joren: there's no "best case is O(1)". When I say the lower bound of the algorithm is O(n*m), this implies that no algorithm can solve the problem with a Big-O better than O(n*m). When I say a particular algorithm is O(n*m), I'm saying the algorithm never performs worse than n*m. The best-case is the Big Omega.
Daniel
@Daniel -- your usage of ""lower bound" is a little strange to me. Normally, O(g(n)) means that g(n) provides an upper-bound on the algorithm speed, f(n). That is f(n) <= g(n) as n tends toward infinity. I assumed that when you said O(n*m) was a lower bound that were referring to the algorithm was bounded below by n*m, i.e., the algorithm is θ(n*m). What it seems like you are saying is that n*m is a lower bound on g(n) (which seems redundant). Is that correct?
tvanfosson
I, too, was somewhat sloppy with my semantics -- I should have said "I don't think n*m is lower bound on the algorithm" instead of "I don't think O(n*m) is a lower bound on the algorithm." The algorithm is bounded from below by g(n,m) = n+m and from above by g(n,m) = n*m.
tvanfosson
+6  A: 

O(min(n,m)) time is impossible: Take two lists [x;x;...;x;y] and [x;x;...;x;z]. You have to browse both lists till the end to compare y and z.

Even O(n+m) is impossible. Take [1,1,...,1] - n times and [1,1,...,1] - m times Then the resulting list should have n*m elements. You need at least O(n m) (correctly Omega(n m)) time do create such list.

Without cartesian product (simple merge), this is quite easy. Ocaml code (I don't know F#, should be reasonably close; compiled but not tested):

let rec merge a b = match (a,b) with
   ([], xs) -> xs
|  (xs, []) -> xs
|  (x::xs, y::ys) -> if x <= y then x::(merge xs (y::ys))
                else y::(merge (x::xs) (y::ys));;

(Edit: I was too late)

So your code in O(n m) is the best possible in worst case. However, IIUIC it performs always n*m operations, which is not optimal.

My approach would be

1) write a function

group : 'a list -> ('a * int) list

that counts the number of same elements:

group [1,1,1,1,1,2,2,3] == [(1,5);(2,2);(3,1)]

2) use it to merge both lists using similar code as before (there you can multiply those coefficients)

3) write a function

ungroup : ('a * int) list -> 'a list

and compose those three.

This has complexity O(n+m+x) where x is the length of resulting list. This is the best possible up to constant.

Edit: Here you go:

let group x =
  let rec group2 l m =
    match l with
    | [] -> []
    | a1::a2::r when a1 == a2 -> group2 (a2::r) (m+1)
    | x::r -> (x, m+1)::(group2 r 0)
  in group2 x 0;;

let rec merge a b = match (a,b) with
   ([], xs) -> []
|  (xs, []) -> []
|  ((x, xm)::xs, (y, ym)::ys) -> if x == y then (x, xm*ym)::(merge xs ys)
                           else  if x <  y then merge xs ((y, ym)::ys)
                                           else merge ((x, xm)::xs) ys;;

let rec ungroup a =
  match a with
    [] -> []
  | (x, 0)::l -> ungroup l
  | (x, m)::l -> x::(ungroup ((x,m-1)::l));;

let crossjoin x y = ungroup (merge (group x) (group y));;



# crossjoin [2; 4; 6; 8; 8; 10; 12] [-7; -8; 2; 2; 3; 4; 4; 8; 8; 8;];;
- : int list = [2; 2; 4; 4; 8; 8; 8; 8; 8; 8]
sdcvvc
I cannot say about F#, but it is certainly not impossible in general.
RBarryYoung
Could you elaborate? You *are* aware these are lists, not arrays?
sdcvvc
Consider two lists containing identical elements, of length *n* and *m*, say. Then the inner join must return *nm* elements. How, then, can the worst case complexity be better than *O(nm)*?
Stephan202
The question is worded wrongly. If you look at the expected output, this isn't join at all. It's an intersection of two lists (as in, no discarding of duplicate elements).
Pavel Minaev
Sorry, the above comment is wrong - this is indeed a cartesian product (in disguise), so your explanation as to why it is O(m*n) in worst case is absolutely correct.
Pavel Minaev
Ah rats, I miscounted the 8's in the example output... You're right Pavel, that makes it a standard INNER JOIN that cross-products on values with multiple matches.
RBarryYoung
+1, +answer: I like your explanation, and the code is very readable :)
Juliet
+1  A: 

I don't know F#, but I can provide a functional Haskell implementation, based on the algorithm outlined by tvanfosson (further specified by Lasse V. Karlsen).

import Data.List

join :: (Ord a) => [a] -> [a] -> [a]
join l r = gjoin (group l) (group r)
  where
    gjoin [] _ = []
    gjoin _ [] = []
    gjoin l@(lh@(x:_):xs) r@(rh@(y:_):ys)
      | x == y    = replicate (length lh * length rh) x ++ gjoin xs ys
      | x < y     = gjoin xs r
      | otherwise = gjoin l ys

main :: IO ()
main = print $ join [2, 4, 6, 8, 8, 10, 12] [-7, -8, 2, 2, 3, 4, 4, 8, 8, 8]

This prints [2,2,4,4,8,8,8,8,8,8]. I case you're not familiar with Haskell, some references to the documentation:

Stephan202
This isn't tail recursive though, and `++` is itself O(n).
Pavel Minaev
The output of the algorithm is fundamentally is *O(nm)*. Since `++` is used to concatenate the pieces which make up the output, its *O(n)* complexity does not increase the overall complexity. Furthermore, can you explain to me why you think this code is not tail recursive? It looks fine to me.
Stephan202
`gjoin` calls itself recursively in a non-tail-call position (right side of `++`).
Pavel Minaev
In the case `x == y`, the value of `gjoin xs ys` is not evaluated until the left-hand argument of `++` is completely evaluated. So where's the issue?
Stephan202
A: 

The problem with what he wants is that it obviously has to re-traverse the list.

In order to get 8,8,8 to show up twice, the function has to loop thru the second list a bit. Worst case scenario (two identical lists) will still yield O(x * y)

As a note, this is not utilizing external functions that loop on their own.

for (int i = 0; i < shorterList.Length; i++)
{
    if (shorterList[i] > longerList[longerList.Length - 1])
        break;
    for (int j = i; j < longerList.Length && longerList[j] <= shorterList[i]; j++)
    {
        if (shorterList[i] == longerList[j])
            retList.Add(shorterList[i]);
    }
}
JustLoren
No, it doesn't "obviously" have to re-traverse the list. You can just count the number of equal elements in both lists once you run into the first matching pair.
Pavel Minaev
Which means that you have to traverse the second list in order to find all the things that match with your currently selected element in the first list. Once you have all the matches, you move to the second element in the first list, and traverse a portion of the 2nd list again.
JustLoren
You don't have to traverse the _entire_ second list - they are sorted. Well, okay, you will have to traverse the entire list if the input is two identical lists of all equal values, like `[1;1;...1]` - but then you still end up traversing both lists to the end once (to calculate the number of 1s), since after you do that, the remaining of both is an empty list. So it will still be O(min(m,n)).
Pavel Minaev
... er, sorry, O(max(m,n)).
Pavel Minaev
Hrm, so you're saying count the matches then do math? if 8 shows up in the first list twice, then the second list three times, do 2 * 3 and add the 8 to the list six times?
JustLoren
Pretty much. Note that I actually misunderstood the question, as it stands - sdcvvc got it right in his answer above: you still need O(m*n) because you end up with m*n pairs in the worst case regardless of everything else, so just creating them will be O(m*n).
Pavel Minaev
@JustLoren: only minor parts of the list have to be ever retraversed.
Daniel
+2  A: 

The following is also tail-recursive (so far as I can tell), but the output list is consequently reversed:

let rec merge xs ys acc =
    match (xs, ys) with
    | ((x :: xt), (y :: yt)) ->
        if x = y then
            let rec count_and_remove_leading zs acc =
                match zs with
                | z :: zt when z = x -> count_and_remove_leading zt (acc + 1)
                | _ -> (acc, zs)
            let rec replicate_and_prepend zs n =
                if n = 0 then
                    zs
                else
                    replicate_and_prepend (x :: zs) (n - 1)
            let xn, xt = count_and_remove_leading xs 0
            let yn, yt = count_and_remove_leading ys 0
            merge xt yt (replicate_and_prepend acc (xn * yn))
        else if x < y then
            merge xt ys acc
        else
            merge xs yt acc
    | _ -> acc

let xs = [2; 4; 6; 8; 8; 10; 12]
let ys = [-7; -8; 2; 2; 3; 4; 4; 8; 8; 8;]
printf "%A" (merge xs ys [])

Output:

[8; 8; 8; 8; 8; 8; 4; 4; 2; 2]

Note that, as sdcvvc says in his answer, this is still O(x.length * y.length) in worst case, simply because the edge case of two lists of repeating identical elements would require the creation of x.length * y.length values in the output list, which is by itself inherently an O(m*n) operation.

Pavel Minaev
@Pavel: beautiful code, but the result is wrong. I'm not looking for an intersection, I'm looking for a cartesian product, but constraining the output such that each x' = y' for each (x', y') in x*y. Note that there are 8 appears twice in the first list, three times in the second list, so it should appear 2*3=6 times in the output.
Juliet
Nevermind... and check the updated version.
Pavel Minaev
A: 

I think this is O(n) on the intersect/join code, though the full thing traverses each list twice:

// list unique elements and their multiplicity (also reverses sorting)
// e.g. pack y = [(8, 3); (4, 2); (3, 1); (2, 2); (-8, 1); (-7, 1)]
// we assume xs is ordered
let pack xs = Seq.fold (fun acc x ->
    match acc with
    | (y,ny) :: tl -> if y=x then (x,ny+1) :: tl else (x,1) :: acc
    | [] -> [(x,1)]) [] xs

let unpack px = [ for (x,nx) in px do for i in 1 .. nx do yield x ]

// for lists of (x,nx) and (y,ny), returns list of (x,nx*ny) when x=y
// assumes inputs are sorted descending (from pack function)
// and returns results sorted ascending
let intersect_mult xs ys =
    let rec aux rx ry acc =
        match (rx,ry) with
        | (x,nx)::xtl, (y,ny)::ytl -> 
            if x = y then aux xtl ytl ((x,nx*ny) :: acc)
            elif x < y then aux rx ytl acc
            else aux xtl ry acc
        | _,_ -> acc
    aux xs ys []

let inner_join x y = intersect_mult (pack x) (pack y) |> unpack

Now we test it on your sample data

let x = [2; 4; 6; 8; 8; 10; 12]
let y = [-7; -8; 2; 2; 3; 4; 4; 8; 8; 8;]

> inner_join x y;;
val it : int list = [2; 2; 4; 4; 8; 8; 8; 8; 8; 8]

EDIT: I just realized this is the same idea as the earlier answer by sdcvvc (after the edit).

Gabriel
+1  A: 

I think it can be done simply by using hash tables. The hash tables store the frequencies of the elements in each list. These are then used to create a list where the frequency of each element e is frequency of e in X multiplied by the frequency of e in Y. This has a complexity of O(n+m).

(EDIT: Just noticed that this can be worst case O(n^2), after reading comments on other posts. Something very much like this has already been posted. Sorry for the duplicate. I'm keeping the post in case the code helps.)

I don't know F#, so I'm attaching Python code. I'm hoping the code is readable enough to be converted to F# easily.

def join(x,y):
    x_count=dict() 
    y_count=dict() 

    for elem in x:
     x_count[elem]=x_count.get(elem,0)+1
    for elem in y:
     y_count[elem]=y_count.get(elem,0)+1

    answer=[]
    for elem in x_count:
     if elem in y_count:
      answer.extend( [elem]*(x_count[elem]*y_count[elem] ) )
    return answer

A=[2, 4, 6, 8, 8, 10, 12]
B=[-8, -7, 2, 2, 3, 4, 4, 8, 8, 8]
print join(A,B)
MAK
A: 

You can't get O(min(x.length, y.length)), because the output may be greater than that. Supppose all elements of x and y are equal, for instance. Then the output size is the product of the size of x and y, which gives a lower bound to the efficiency of the algorithm.

Here's the algorithm in F#. It is not tail-recursive, which can be easily fixed. The trick is doing mutual recursion. Also note that I may invert the order of the list given to prod to avoid unnecessary work.

let rec prod xs ys = 
    match xs with
    | [] -> []
    | z :: zs -> reps xs ys ys
and reps xs ys zs =
    match zs with
    | [] -> []
    | w :: ws -> if  xs.Head = w then w :: reps xs ys ws
                 else if xs.Head > w then reps xs ys ws
                 else match ys with
                      | [] -> []
                      | y :: yss -> if y < xs.Head then prod ys xs.Tail else prod xs.Tail ys

The original algorithm in Scala:

def prod(x: List[Int], y: List[Int]): List[Int] = x match {
  case Nil => Nil
  case z :: zs => reps(x, y, y)
}

def reps(x: List[Int], y: List[Int], z: List[Int]): List[Int] = z match {
  case w :: ws if x.head == w => w :: reps(x, y, ws)
  case w :: ws if x.head > w => reps(x, y, ws)
  case _ => y match {
    case Nil => Nil
    case y1 :: ys if y1 < x.head => prod(y, x.tail)
    case _ => prod(x.tail, y)
  }
}
Daniel