views:

1017

answers:

5

Can somebody please demonstrate for me a more efficient Cartesian product algorithm than the one I am using currently (assuming there is one). I've looked around SO and googled a bit but can't see anything obvious so I could be missing something.

foreach (int i in is) {
   foreach (int j in js) {
      //Pair i and j
   }
}

This is a highly simplified version of what I do in my code. The two integers are lookup keys which are used to retrieve one/more objects and all the objects from the two lookups are paired together into new objects.

This small block of code in a much larger more complex system becomes a major performance bottleneck as the dataset it's operating over scales. Some of this could likely be mitigated by improving the data structures used to store the objects and the lookups involved but the main issue I feel is still the computation of the Cartesian product itself.

Edit

So some more background on my specific usage of the algorithm to see if there may be any tricks that I can use in response to Marc's comment. The overall system is a SPARQL query engine which processes SPARQL queries over sets of Graph data, SPARQL is a pattern based language so each query consists of a series of patterns which are matched against the Graph(s). In the case where two subsequent patterns have no common variables (they are disjoint) it is necessary to compute the Cartesian product of the solutions produced by the two patterns to get the set of possible solutions for the overall query. There may be any number of patterns and I may have to compute Cartesian products multiple times which can lead to a fairly exponential expansion in possible solutions if the query is composed of a series of disjoint patterns.

Somehow from the existing answers I doubt whether there any tricks that could apply

Update

So I thought I would post an update on what I implemented in order to minimise the need to do Cartesian products and thus optimise the query engine generally. Note that it's not always possible to completely eliminate the need for products but it's nearly always possible to optimise so the size of the two sets being joined is much smaller.

Since each BGP (Basic Graph Pattern) which is a set of Triple Patterns is executed as a block (in essence) the engine is free to reorder patterns within a BGP for optimal performance. For example consider the following BGP:

?a :someProperty ?b .
?c :anotherProperty ?d .
?b a :Class .

Executed as is the query requires a cartesian product since the results of the first pattern are disjoint from the second pattern so the results of the first two patterns is a cartesian product of their individual results. This result will contain far more results than we actually need since the third pattern restricts the possible results of the first pattern but we don't apply this restriction till afterwards. But if we reorder like so:

?b a :Class .
?a :someProperty ?b .
?c :anotherProperty ?d .

We'll still need a cartesian product to get the final results since the 2nd and 3rd patterns are still disjoint but by reordering we restrict the size of the results of the 2nd pattern meaning the size of our cartesian product will be much smaller.

There are some various other optimisations we make but I'm not going to post them here as it starts to get into fairly detailed discussion of SPARQL engine internals. If anyone is interested in further details just leave a comment or send me a tweet @RobVesse

+17  A: 

The complexity of cartesian product is O(n^2), there is no shortcut.

In specific cases, the order you iterate your two axis is important. For example, if your code is visiting every slot in an array, - or every pixel in an in image, - then you should try to visit the slots in natural order. An image is typically layed out in 'scanlines', so the pixels on any Y are adjacent. Therefore, you should iterate over the Y on the outer loop and the X on the inner.

Whether you need cartesian product or where there is a more efficient algorithm depends on the problem you are solving.

Will
+1 for encouraging data locality!
Drew Hall
exactly, the cartesian product output is O(n^2) which means just writing down the output in memory costs O(n^2) operations, hence no algorithm can be faster.
ldog
+6  A: 

You can't really change the performance of a nested loop without some additional knowledge, but that would be use-specific. If you have got n items in is and m items in js, it is always going to be O(n*m).

You can change the look of it though:

var qry = from i in is
          from j in js
          select /*something involving i/j */;

This is still O(n*m), but has nominal extra overhead of LINQ (you won't notice it in normal usage, though).

What are you doing in your case? There may be tricks...

One thing to definitely avoid is anything that forces a cross-join to buffer - the foreach approach is fine and doesn't buffer - but if you add each item to a List<>, then beware the memory implications. Ditto OrderBy etc (if used inappropriately).

Marc Gravell
if you're working with C# 4.0, or PLINQ, and have a multicore machine, you could add an .AsParallel(), as in `from i in is.AsParallel()`
Rubens Farias
added more background for my case but I doubt there are any tricks that could apply
RobV
using the LINQ approach wouldn't be that useful in my case since there are some checks and function calls involved in deciding whether to actually do the inner loop
RobV
You can of course mix LINQ and checks..., but point taken. Note Rubens point - I like that ;-p
Marc Gravell
at least one of the calls though is completely unrelated to the pairing and is a timeout check used to limit execution time to prevent queries running for too long which can't really be used, plus some of the checks themselves involve LINQ queries which would lead to a horribly complex query. Rubens has a point but I'm not able to move to C# 4.0 anytime soon
RobV
+1  A: 

Additional information has been added to the question.

The duplicates can be avoided if you record those you've already computed so as to avoid duplicating them again - it's assumed that the cost of such bookkeeping - a hashmap or a simple list - is less than the cost of computing a duplicate.

The C# runtime is really very fast, but for extreme heavy lifting, you might want to drop into native code.

You might also notice the essential parallelness of this problem. If the computation of a product does not impact teh computation of any other product, you could straightforwardly use a multiple processor cores for doing the work in parallel. Look at ThreadPool.QueueUserWorkItem.

Will
in my case the duplicates are necessary, some sort of parallelism may work though
RobV
A: 

If cache locality (or local memory required to maintain the j's) is a problem, you can make your algorithm more cache-friendly by bisecting the input arrays recursively. Something like:

cartprod(is,istart,ilen, js,jstart,jlen) {
  if(ilen <= IMIN && jlen <= JMIN) { // base case
    for(int i in is) {
      for(int j in js) {
        // pair i and j
      }
    }
    return;
  }
  if(ilen > IMIN && jlen > JMIN) { // divide in 4
    ilen2= ilen>>1;
    jlen2= jlen>>1;
    cartprod(is,istart,ilen2,            js,jstart,jlen2);
    cartprod(is,istart+ilen2,ilen-ilen2, js,jstart,jlen2);
    cartprod(is,istart+ilen2,ilen-ilen2, js,jstart+jlen2,jlen-jlen2);
    cartprod(is,istart,ilen2,            js,jstart+jlen2,jlen-jlen2);
    return;
  }
  // handle other cases...
}

Note that this access pattern will automatically take fairly good advantage of all levels of automatic cache; this kind of technique is called cache-oblivious algorithm design.

comingstorm
+1  A: 

I can't propose anything better, than O(n^2), because that's the size of the output, and the algorithm hence can't be any faster.

What I can suggest is using another approach to whether you need to compute product. For example, you may not even need the product set P if only you are going to query whether certain elements belong to it. You only need the information about the sets it's composed of.

Indeed (pseudocode)

bool IsInSet(pair (x,y), CartesianProductSet P)
{
   return IsInHash(x,P.set[1]) && IsInHash(y,P.set[2])
}

where Cartesian product is calculated like this:

// Cartesian product of A and B is
P.set[1]=A; P.set[2]=B;

If you implement sets as hashes, then lookup in a cartesian product of m sets is just a lookup in m hashes you get for free. Construction of the cartesian product and IsInSet lookup each take O(m) time, where m is a number of sets you multiply, and it's much less than n--size of each set.

Pavel Shved
unfortunately it is necessary to have the entire product set so your idea won't work in my case though it is nice
RobV
@RobV, then, oops, you've been n-squared! :-)
Pavel Shved