views:

421

answers:

9

In a 2D pixel array, I need an efficient algorithm that will select p% of pixels that are the most spread out.

This can be done adaptively by selecting points, then repeatedly adjusting the positions of points that are too close together. But this isn't efficient since it would require many iterations and distance calculations.

It doesn't have to be perfect, it just needs to avoid point clusters as much as can be done efficiently.

A: 

How about this:

  1. Discover the sum of the distances from each point to each other point. So point A has a sum distance of dist(A, B) + dist(A, C) + dist(A, D) + ...
  2. Sort these summed distances.
  3. Remove points that have the smallest distance sums until you reach your desired percentage.

This may be accurate enough, but if not, you could always replace step 3 with:

"Remove the point that has the smallest sum, and if you need to remove more points to reach your desired percentage, then go back to step 1."

Wait. Now I'm wondering. Are you trying to find the points that are most spread out from a given set of points...or trying, from a given array, to find the points that would be the most spread out? That's totally different...and still really hard.

Beska
Thanks for the answer, but I see two potential problems:1. Calculation of each point-point distance has an order-n-squared running time, which isn't efficient, and2. The distance calculations involve floating-point arithmetic, which is relatively slow compared to integer operations.
To answer your question, I'm trying to find the p% points that would be the most spread out, or at least not clustered.
Oh yeah, no question...if you're looking at a large data set, you're asking for some serious computation time.
Beska
I'm still not 100% sure what you're asking for...do you have a populated array, that has some "pixels" set in it, and by removing some of these pixels you're trying to maximize the spread? Or are you trying to find the most spread out formation of X pixels in an array of size Y,Z? (This answer was to the former possibility...my other answer is to the latter possibility).
Beska
A: 

How about calculating a "density" value for each pixel to start with, based on its proximity to all the other pixels. Then, repeatedly remove the most "dense" pixel until you're below p% remaining in the list.

You'd need to do the distance calculation to determine density between any given two points at most twice. First time would be when you build the original list - each pixel would need to be compared with each other pixel. Second would be when you remove a pixel from the list - you'd have to calculate the removed one against each one remaining in the list. This is to account for density values changing as each pixel is removed - for example, 2 pixels directly next to each other would have a very very high value, but once one is removed, the remaining one might have a very low value.

Some quick pseudocode (note that in this example, higher density areas have a low number)

For I = 0 to MaxPixel
    For J = I to MaxPixel
        PixelDensity[I], PixelDensity[J] += DistanceBetween(Pixels[I], Pixels[J])

While PixelDensity.Count > TargetCount
    RemovePixel = IndexOfSmallest(PixelDensity)
    ForEach I in PixelDensity
        PixelDensity[I] -= DistanceBetween(Pixels[I], Pixels[RemovePixel])
    PixelDensity.Remove(RemovePixel)

If memory is less of a concern than calculation time, you can also store the distance between any two points in a simple 2d array. Also, instead of simple distance, it might be helpful to make the distance calculation exponential - that would avoid something like having two points almost on top of each other, but far away from everything else, and having both make it through.

matthock
It sounds like it might work, but the iterations and floating-point calculations might make it slow.
Yeah, I'm not really sure what size data set you're dealing with.Is there possibly a fixed precision 'filter' type operation you can run when deciding what points to compare? For example, you can say "If these points are over Z distance apart on either the X or Y axis, they're far enough apart to ignore and not do the floating point calculations". Could save you some computation time with next to no effect on accuracy.
matthock
A: 

An iterative mine-sweeper flood-fill approach would be straightforward to visualise.

  1. For each cell, find the two nearest points and record the product of those two distances.
  2. Those cells with the highest products are those that are attached to the furthest points.
Will
It sounds like it might work, but the iterations and distance calculations might make it slow.
+1  A: 

You want a Poisson Disk distribution, but it's tricky. Doing a search turns up lots of academic papers about how to do it efficiently: http://people.csail.mit.edu/thouis/JonesPoissonPreprint.pdf

Ned Batchelder
It sounds complex with floating-point calculations, which may be slow.
yeah, it sure does!
Ned Batchelder
A: 

Ooh! How about this!

(In a very hand-wavey way, since I don't know whether your matrix is square or whatever...I'll assume it is.)

Say you have a 1000x1000 array that you want to place 47 point into (I'm picking 47 so that it is an unusual number that won't fit "nicely").

You take the ceil(sqrt(47))...that will give you a value (7). So we make a 7x7 square, fill it with 47 pixels (some are blank), and imagine placing that in the corner of the array.

Now, translate each of those pixels to a new location, based on where they are in the small (7x7) array to the large array (1000x1000). A simple equation should do this for you...for the X coordinate, for example:

xBigArrayIndex = xSmallArrayIndex * 1000 / 7;

Then your pixels will be super spread out! And it's nice and fast.

The only downside is that this only works perfectly when your square is ideally spaced to begin with...if you fill it naively (starting at the top left, going across, etc.) you will end up with with a slightly sub-ideal spread...since the translated pixels won't quite reach the bottom right of the large array. But maybe this is good enough? And if not, perhaps it's a smaller subset of the problem that is easier to deal with?

Beska
It's efficient. But for higher densities and non-square arrays, the points would form clusters. The array I'm using is in the shape of a long narrow strip, roughly 30 pixels wide (this can change) and thousands of pixels long. (This is the best answer so far...)
A: 

You can Use convex hull algorithm, and exclude points which this algorithm will compute and repeat it as long as it get to your p% criteria, or

execute convex hull algorithm steps, check for points included in hull and inside of it to meet the criteria 100% - p%

some demos of convex hull are here http://www.cs.unc.edu/~snoeyink/demos/

and here you got some more info http://softsurfer.com/Archive/algorithm_0109/algorithm_0109.htm

MoreThanChaos
+1  A: 

Thanks to everyone for the answers!

The best solution appears to be using "prebuilt building blocks": n x n arrays with already-selected cells, and covering the pixel array with these.

For example, a 4 x 4 array with 12.5% coverage would be:

0 0 1 0
0 0 0 0
1 0 0 0
0 0 0 0

With 6.3% coverage:

0 0 0 0
0 1 0 0
0 0 0 0
0 0 0 0

To get a % coverage between these, just alternate between these blocks according to a running tally of the overall actual % coverage so far. To cover a width that's not a multiple of 4, use some 3 x 3 blocks. To cover a larger area more efficiently, just use larger blocks.

This covers the whole array efficiently with no distance calculations or floating-point arithmetic.

A: 

The "most spread out" selection of pixels is the set whose Delaunay triangulation consists of equilateral triangles. The set of points which leads to this triangulation is found by splitting the pixel array into a set of boxes, where each box is sqrt(3) longer than it is wide. Each box contributes 5 pixels to the final pixel set (one at each corner, plus a center node at the center of the box). The trick is to find how many rows and columns of boxes will give you this 1:sqrt(3) ratio. Without going through the derivation, here's how you get that:

std::vector<PixelIndices> PickPixels(int width, int height, float percent)
{
  int total_pixels = width*height;
  int desired_pixel_count = (int)total_pixels*percent;

  // split the region up into "boxes" with 4 corner nodes and a center node.
  // each box is sqrt(3) times taller than it is wide.

  // calculate how many columns of boxes
  float a = 1.155*height/(float)width;
  float b = .577*height/(float)width + 1;
  float c = 1 - desired_pixel_count;
  int num_columns = (int)((-b + sqrt(b*b -4*a*c))/(2*a));

  // Now calculate how many rows
  int num_rows = .577*height*num_columns/(float)width;

  // total number of pixels
  int actual_pixel_count = 2*num_rows*num_columns + num_rows + num_columns + 1;

  std::cout << "  Total pixels: " << total_pixels << std::endl;
  std::cout << "       Percent: " << percent << std::endl;
  std::cout << "Desired pixels: " << desired_pixel_count << std::endl;
  std::cout << " Actual pixels: " << actual_pixel_count << std::endl;
  std::cout << "   Number Rows: " << num_rows << std::endl;
  std::cout << "Number Columns: " << num_columns << std::endl;

  // Pre-allocate space for the pixels
  std::vector<PixelIndices> results;
  results.reserve(actual_pixel_count);

  // Now get the pixels, my integer math is probably wrong here, didn't test
  //  (didn't even finish, ran out of time)
  for (int row = 0; row <= num_rows; row++)
  {
    int row_index = row*height/num_rows;

    // Top of box
    for (int col = 0; col <= num_columns; col++)
    {
      int col_index = col*width/num_columns;
      results.push_back(PixelIndices(row_index, col_index));
    }

    // Middle of box
    if (row != num_columns)
    {
      for (int col = 0; col < num_columns; col++)
      {
         // I'll leave it to you to get this, I gotta go!
      }
    }
  }

  return results;
}

Instead of using integer division to find the indices, you could speed this up by finding the distance between each point in a row/column and just adding by the offset.

Darryl
A: 

You might try Wang tiles:
http://en.wikipedia.org/wiki/Wang%5Ftile
(See the pages linked there to Cohen's paper, and Kopf's paper. I'm a new user so can't post all the links).

These tie together both the prebuilt tiles idea, as well as the evenly distributed requirement usually solved with Poisson-disk patterns. Wang tiles can avoid periodicity effects that are almost surely an issue with more direct use of prebuilt tiles.

thouis