ansaurus

Question

More efficient way to count intersections?

Answer 1

A:

You can probably start by combining the two functions to create both results at once. Also there's no need to make a list of the combinations before looping, as it is already a generator, and that might save you some time.

def mask_connectivity_matrix_and_tracks(tracks,masks,masks_coords_list):
    connect_mat=zeros((len(masks),len(masks)))
    vox_tracks_img=zeros((xdim,ydim,zdim,len(masks)))
    for track in tracks:
        cur=[]
        for count,mask_coords in enumerate(masks_coords_list):
            if any(set(track) & set(mask_coords)):
                cur.append(count)
                for x,y,z in track:
                    vox_tracks_img[x,y,z,count] += 1
            for x,y in itertools.combinations(cur,2):
                connect_mat[x,y] += 1

Also, this will probably never be "fast" as in "finished before we die", so the best way is to eventually compile it with Cython as a c module for python.

Tor Valamo 2009-12-15 22:19:13

Answer 2

A:

If you stored the each mask set of points: (1,2,3), (1,2,4), (1,3,1) as a dictionary like this: {1: [{2: set([3, 4])}, {3: set([1])}]}, you might end up being able to check for matches faster...but maybe not.

Brian 2009-12-15 22:22:01

Answer 3

A:

A minor optimization (same big-O, sligthly smaller multiplier) can be had by removing redundant operations:

don't call set so many times on each track and mask: call it once per track and once per mask, to set up auxiliary "parallel" lists of sets, then work on those
if any(someset): is semantically the same as if someset: but a bit slower

Won't make a dramatic difference, but might minutely help.

Alex Martelli 2009-12-15 22:24:29

Answer 4

A:

Lame to suggest yet another incremental improvement that might be made, I know, but:

Sets of small integers can be modeled as bit-vectors using Python's long ints. Suppose you replace each tuple with a small integer id, then convert each track and each set of mask-coords into a set of those small ids. You could represent those sets as long ints, making the intersection operation a bit faster (but not asymptotically faster).

Jason Orendorff 2009-12-15 22:31:58

Answer 5

+1 A:

OK, I think I finally have something that will reduce the complexity. This code should really fly compared to what you've got.

It seems like first you need to know which tracks coincide with which masks, the incidence matrix.

import numpy
from collections import defaultdict

def by_point(sets):
    d = defaultdict(list)
    for i, s in enumerate(sets):
        for pt in s:
            d[pt].append(i)
    return d

def calc(xdim, ydim, zdim, mask_coords_list, tracks):
    masks_by_point = by_point(mask_coords_list)
    tracks_by_point = by_point(tracks)

    a = numpy.zeros((len(mask_coords_list), len(tracks)), dtype=int)
    for pt, maskids in masks_by_point.iteritems():
        for trackid in tracks_by_point.get(pt, ()):
            a[maskids, trackid] = 1
    m = numpy.matrix(a)

The adjacency matrix you're looking for is m * m.T.

The code you have so far computes the upper triangle only. You can use triu to grab just that half.

    am = m * m.T  # calculate adjacency matrix
    am = numpy.triu(am, 1)  # keep only upper triangle
    am = am.A  # convert matrix back to array

The voxel calculation can use the incidence matrix too.

    vox_tracks_img = numpy.zeros((xdim, ydim, zdim, len(mask_coords_list)), dtype=int)
    for trackid, track in enumerate(tracks):
        for x, y, z in track:
            vox_tracks_img[x, y, z, :] += a[:,trackid]
    return am, vox_tracks_img

For me this runs in under a second for data sets having hundreds of masks and tracks.

If you have many points that appear in masks but are not on any tracks, it might be worthwhile to delete the entries for those points from masks_by_point before entering the loop.

Jason Orendorff 2009-12-15 23:06:21

Answer 6

+2 A:

Linearize the voxel coordinates, and put them into two scipy.sparse.sparse.csc matrices.

Let v be the number of voxels, m the number of masks, and t the number of tracks.
Let M be the mask csc matrix, size (m x v), where a 1 at (i,j) means mask i overlaps voxel j.
Let T be the track csc matrix, size (t x v), where a 1 at (k,j) means track k overlaps voxel j.

Overlap = (M * T.transpose() > 0)  # track T overlaps mask M  
Connected = (Overlap * Overlap.tranpose() > 0) # Connected masks
Density[mask_idx] = numpy.take(T, nonzero(Overlap[mask_idx, :])[0], axis=0).sum(axis=0)

I might be wrong on the last one, and I'm not sure css_matrices can be operated on by nonzero & take. You might need to pull out each column in a loop and convert it to a full matrix.

I ran some experiments trying to simulate what I thought was a reasonable amount of data. The code below takes about 2 minutes on a 2-year old MacBook. If you use csr_matrices, it takes about 4 minutes. There is probably a tradeoff depending on how long each track is.

from numpy import *
from scipy.sparse import csc_matrix

nvox = 1000000
ntracks = 300000
nmask = 100

# create about 100 entries per track
tcoords = random.uniform(0, ntracks, ntracks * 100).astype(int)
vcoords = random.uniform(0, nvox, ntracks * 100).astype(int)
d = ones(ntracks * 100)
T = csc_matrix((d,  vstack((tcoords, vcoords))), shape=(ntracks, nvox), dtype=bool)

# create around 10000 entries per mask
mcoords = random.uniform(0, nmask, nmask * 10000).astype(int)
vcoords = random.uniform(0, nvox, nmask * 10000).astype(int)
d = ones(nmask * 10000)
M = csc_matrix((d, vstack((mcoords, vcoords))), shape=(nmask, nvox), dtype=bool)

Overlap = (M * T.transpose()).astype(bool) # mask M overlaps track T
Connected = (Overlap * Overlap.transpose()).astype(bool) # mask M1 and M2 are connected
Density = Overlap * T.astype(float) # number of tracks overlapping mask M summed across voxels

thouis 2009-12-16 01:46:14

If the dtype of the matrices is set to bool, I don't think the "> 0" bits are necessary anymore.

thouis 2009-12-16 01:52:00

actually, not true. at least for sparse matrices, multipltication promotes them to a byte. (?) I hope that doesn't mean there are wraparound issues, as well.

thouis 2009-12-16 03:06:04

Thanks for this, sped me up to under one minute with the average track length around 10 and the average mask size around 500.

jbrown 2009-12-21 20:33:04

Glad to help. Did you validate against the slow method to make sure wraparound wasn't an issue? I'm guessing not, as numpy is pretty solid, but I like to be careful about these things (and I'd like to know the answer for future reference).

thouis 2009-12-22 02:48:27

Actually, this did end up causing wraparound. I wasn't familiar with that concept until I found negative numbers in the Connected matrix and had to figure out where they came from. I'm not sure what the upper limits are on max size for csc_matrix multiplication, but there do appear to be limits - my T matrix was 140000x2000000 and my M matrix was 150 x 2000000.

jbrown 2010-07-30 22:55:10

ansaurus

tags:

views:

answers:

More efficient way to count intersections?

related questions