views:

115

answers:

5

I have a set of trajectories, made up of points along the trajectory, and with the coordinates associated with each point. I store these in a 3d array ( trajectory, point, param). I want to find the set of r trajectories that have the maximum accumulated distance between the possible pairwise combinations of these trajectories. My first attempt, which I think is working looks like this:

max_dist = 0
for h in itertools.combinations ( xrange(num_traj), r):
    for (m,l) in itertools.combinations (h, 2):
        accum = 0.
        for ( i, j ) in itertools.izip ( range(k), range(k) ):
            A = [ (my_mat[m, i, z] - my_mat[l, j, z])**2 \
                    for z in xrange(k) ]
            A = numpy.array( numpy.sqrt (A) ).sum()
            accum += A
    if max_dist < accum:
        selected_trajectories = h

This takes forever, as num_traj can be around 500-1000, and r can be around 5-20. k is arbitrary, but can typically be up to 50.

Trying to be super-clever, I have put everything into two nested list comprehensions, making heavy use of itertools:

chunk = [[ numpy.sqrt((my_mat[m, i, :] - my_mat[l, j, :])**2).sum() \
        for ((m,l),i,j) in \
        itertools.product ( itertools.combinations(h,2), range(k), range(k)) ]\
        for h in itertools.combinations(range(num_traj), r) ]

Apart from being quite unreadable (!!!), it is also taking a long time. Can anyone suggest any ways to improve on this?

+3  A: 

Rather than recalculate the distance between each pair of trajectories on-demand, you can start by calculating the distance between all pairs of trajectories. You can store those in a dictionary and look them up as needed.

This way your inner-loop for (i,j) ... will be replaced with a constant-time lookup.

mathmike
This has been a major source of speeding up! Thanks!
Jose
+2  A: 

You can ditch the square root calculation on the distance calculation... the maximal sum will also have the maximal squared sum, although that only yields a constant speedup.

patros
+1  A: 

It's likely to take forever anyhow, as your algorithm takes about ~ O( C( N, r ) * r^2 ), where C( N, r ) is N choose r. For smaller r's (or N) this might be okay, but if you absolutely need to find the max, as opposed to using an approximation heuristic, you should try branch and bound with different strategies. This might work for smaller r's, and it saves you quite a bit on unnecessary recalculations.

Larry
+2  A: 

Here are few points of interest and suggestions in addition to what everyone else has mentioned. (By the way, mathmike's suggestion of generating a look-up list all all pair distances is one that you should put in place immediately. It gets rid of a O(r^2) from your algorithm complexity.)

First, the lines

for ( i, j ) in itertools.izip ( range(k), range(k) ):
    A = [ (my_mat[m, i, z] - my_mat[l, j, z])**2 \
        for z in xrange(k) ]

can be replaced with

for i in xrange(k):
    A = [ (my_mat[m, i, z] - my_mat[l, i, z])**2 \
        for z in xrange(k) ]

because i and j are always the same in every loop. There's no need to use izip at all here.

Second, regarding the line

A = numpy.array( numpy.sqrt (A) ).sum()

Are you sure this is how you want to calculate it? Possibly so, but it just struck me as odd because if this was more of a Euclidean distance between vectors then the line would be:

A = numpy.sqrt (numpy.array( A ).sum())

or just

A = numpy.sqrt(sum(A))

because I would think that converting A to a numpy array to use numpy's sum function would be slower than just using the built-in Python sum function, but I could be wrong. Also, if it truly is a Euclidean distance that you want, then you will be doing less sqrt's this way.

Third, do you realize how many potential combinations you may be trying to iterate over? For the worst case with num_traj = 1000 and r = 20, that is approximately 6.79E42 combinations by my estimate. That's quite intractable with your current method. Even for the best case of num_traj = 500 and r = 5, that's 1.28E12 combinations which is quite a few, but not impossible. This is the real problem you're having here because by taking mathmike's advice, the first two points that I have mentioned aren't very important.

What can you do then? Well, you will need to be a little more clever. It isn't clear to me yet what would be a great method use for this. I'm guessing that you will need to make the algorithm heuristic in some way. One thought I had was to try a dynamic programming sort of approach with a heuristic. For each trajectory you could find the sum or mean of the distances for every pairing of it with another trajectory and use this as a fitness measure. Some of the trajectories with the lowest fitness measures could be dropped before moving on to trios. You could then do the same thing with trios: find the sum or mean of the accumulated distances for all trios (among remaining possible trajectories) that each trajectory is involved with and use that as the fitness measure to decide which ones to drop before moving on to foursomes. It doesn't guarantee the optimal solution, but it should be quite good and it will greatly lower the time complexity of the solution I believe.

Justin Peel
+1  A: 
Denis