views:

463

answers:

2

The last question I asked concerned how to bin data by x co-ordinate. The solution was simple and elegant, and I'm ashamed I didn't see it. This question may be harder (or I may just be blind).

I started out with about 140000 data points and split them into 70 groups equally spaced along the x axis I then took the average position (x_avg, y_avg) of each group and plotted them; a nice curve appeared. Unfortunately there are two problems. First of all, the edges are much less populated than the center of the graph; Second of all, some areas change more than others and thus need a better resolution.

I thus have two specific questions and a general invitation to throw suggestions:

Does matlab have a builtin way of splitting a matrix into either a fixed number of smaller matricies or smaller matricies of a fixed size?

Is there an algorithm (or a matlab function, but I find that unlikely) to determine the boundaries required to bin regions of interest more finely?

More generally, is there a better way of condensing tens of thousands of data points into a neat trend?

+1  A: 

I've never used matlab but from looking at your previous question I suspect your looking for something along the lines of a Kdtree or a variation.

Clarification: Since there seems to be some confusion about this I think an pseudocode example is in order.

// Some of this shamelessly borrowed from the wikipedia article
function kdtree(points, lower_bound, upper_bound) {
    // lower_bound and upper_bound are the boundaries of your bucket
    if(points is empty) {
     return nil
    }
    // It's a trivial exercise to control the minimum size of a partition as well
    else {
     // Sort the points list and choose the median element
     select median from points.x

     node.location = median;

     node.left = kdtree(select from points where lower_bound < points.x <= median, lower_bound, median);
     node.right = kdtree(select from points where median < points.x <= upper_bound, median, upper_bound);

     return node
    }
}

kdtree(points, -inf, inf)

// or alternatively

kdtree(points, min(points.x), max(points.x))
Kevin Loney
I don't think I need a k-d tree; I'm not sorting in the y direction. Maybe a simple binary tree might work, but I can't see how.
Alex R
k-d tree's are binary trees with the addition of a partitioning plane attached to each node. They also scale regardless of the number of dimensions you have so you can simply ignore the y component, sort on the x component and the planes you get should be near optimal boundaries for your buckets.
Kevin Loney
Even if I have a binary tree, I don't see what I'd do to determine boundaries.
Alex R
+2  A: 

It sounds like you want to use bins that vary in size depending on the density of x values. I think you can still use the function HISTC like in the answer to your previous post, but you would just have to give it a different set of edges.

I don't know if this is exactly want you want, but here's one suggestion: instead of splitting the x axis into 70 equally spaced groups, split the sorted x data into 70 equal groups and determine the edge values. I think this code should work:

% Start by assuming x and y are vectors of data:

nBins = 70;
nValues = length(x);
[xsort,index] = sort(x);  % Sort x in ascending order
ysort = y(index);         % Sort y the same way as x
binEdges = [xsort(1:ceil(nValues/nBins):nValues) xsort(nValues)+1];

% Bin the data and get the averages as in previous post (using ysort instead of y):

[h,whichBin] = histc(xsort,binEdges);

for i = 1:nBins
    flagBinMembers = (whichBin == i);
    binMembers = ysort(flagBinMembers);
    binMean(i) = mean(binMembers);
end

This should give you bins that vary in size with the data density.


UPDATE: Another version...

Here's another idea I came up with after a few comments. With this code, you set a threshold (maxDelta) for the difference between neighboring data points in x. Any x values that differ from their larger neighbor by an amount greater than or equal to maxDelta are forced to be in their own bin (all by their lonesome). You still choose a value for nBins, but the final number of bins will be larger than this value when spread-out points are relegated to their own bins.

% Start by assuming x and y are vectors of data:

maxDelta = 10; % Or whatever suits your data set!
nBins = 70;
nValues = length(x);
[xsort,index] = sort(x);  % Sort x in ascending order
ysort = y(index);         % Sort y the same way as x

% Create bin edges:

edgeIndex = false(1,nValues);
edgeIndex(1:ceil(nValues/nBins):nValues) = true;
edgeIndex = edgeIndex | ([0 diff(xsort)] >= maxDelta);
nBins = sum(edgeIndex);
binEdges = [xsort(edgeIndex) xsort(nValues)+1];

% Bin the data and get the y averages:

[h,whichBin] = histc(xsort,binEdges);

for i = 1:nBins
    flagBinMembers = (whichBin == i);
    binMembers = ysort(flagBinMembers);
    binMean(i) = mean(binMembers);
end

I tested this on a few small sample sets of data and it seems to do what it's supposed to. Hopefully it will work for your data set too, whatever it contains! =)

gnovice
I was considering that, and I will try it. My two concerns are this: I fear losing behavior at the edges, where data is not very dense; and there still may be sections in which I need a higher concentration.
Alex R
I would try the above code with a bunch of different values for nBins. If you make nBins larger, it will cause all of your bins to shrink in size. This would give you a higher concentration of bins where you have a lot of data. As far as "losing behavior at the edges", I'm not sure what you mean.
gnovice
Near the min and max of the x co-ordinate, there are much fewer data points. This will pull the mean away from the extrema, and if there is interesting behavior near the extrema, it may not be evident.
Alex R