views:

347

answers:

4

Say I had some data, for which I want to fit a parametrized model over it. My goal is to find the best value for this model parameter.

I'm doing model selection using a AIC/BIC/MDL type of criterion which rewards models with low error but also penalizes models with high complexity (we're seeking the simplest yet most convincing explanation for this data so to speak, a la Occam's razor).

Following the above, this is an example of the sort of things I get for three different criteria (two are to be minimized, and one to be maximized):

aic-bic fit

Visually you can easily see the elbow shape and you would pick a value for the parameter somewhere in that region. The problem is that I'm doing do this for large number of experiments and I need a way to find this value without intervention.

My first intuition was to try to draw a line at 45 degrees angle from the corner and keep moving it until it intersect the curve, but that's easier said than done :) Also it can miss the region of interest if the curve is somewhat skewed.

Any thoughts on how to implement this, or better ideas?

Here's the samples needed to reproduce one of the plots above:

curve = [8.4663 8.3457 5.4507 5.3275 4.8305 4.7895 4.6889 4.6833 4.6819 4.6542 4.6501 4.6287 4.6162 4.585 4.5535 4.5134 4.474 4.4089 4.3797 4.3494 4.3268 4.3218 4.3206 4.3206 4.3203 4.2975 4.2864 4.2821 4.2544 4.2288 4.2281 4.2265 4.2226 4.2206 4.2146 4.2144 4.2114 4.1923 4.19 4.1894 4.1785 4.178 4.1694 4.1694 4.1694 4.1556 4.1498 4.1498 4.1357 4.1222 4.1222 4.1217 4.1192 4.1178 4.1139 4.1135 4.1125 4.1035 4.1025 4.1023 4.0971 4.0969 4.0915 4.0915 4.0914 4.0836 4.0804 4.0803 4.0722 4.065 4.065 4.0649 4.0644 4.0637 4.0616 4.0616 4.061 4.0572 4.0563 4.056 4.0545 4.0545 4.0522 4.0519 4.0514 4.0484 4.0467 4.0463 4.0422 4.0392 4.0388 4.0385 4.0385 4.0383 4.038 4.0379 4.0375 4.0364 4.0353 4.0344];
plot(1:100, curve)

EDIT

I accepted the solution given by Jonas. Basically, for each point p on the curve, we find the one with the maximum distance d given by:

point-line-distance

+1  A: 

First, a quick calculus review: the first derivative f' of each graph represents the rate at which the function f being graphed is changing. The second derivative f'' represents the rate at which f' is changing. If f'' is small, it means that the graph is changing direction at a modest pace. But if f'' is large, it means the graph is rapidly changing direction.

You want to isolate the points at which f'' is largest over the domain of the graph. These will be candidate points to select for your optimal model. Which point you pick will have to be up to you, since you haven't specified exactly how much you value fitness versus complexity.

John Feminella
The idea is definitely a valid one, but the initial problem remains; how do you specify that threshold value after which you decide that the rate of change is too slow or too fast.. As I described before, I have a large number of experiments which makes it hard to set a general value for all cases.
Amro
As a general rule, you could just pick the largest `f''`.
John Feminella
This is how we tried to do it before. However, taking two derivatives on somewhat noisy data turned out to be not robust enough for our application.
Jonas
+2  A: 

So one way of solving this would be two fit two lines to the L of your elbow. But since there are only a few points in one portion of the curve (as I mentioned in the comment), line fitting takes a hit unless you detect which points are spaced out and interpolate between them to manufacture a more uniform series and then use RANSAC to find two lines to fit the L - a little convoluted but not impossible.

So here's a simpler solution - the graphs you've put up look the way they do thanks to MATLAB's scaling (obviously). So all I did was minimize the distance from the "origin" to your points using the scale information.

Please note: The origin estimation can be improved dramatically, but I'll leave that to you.

Here's the code:

%% Order
curve = [8.4663 8.3457 5.4507 5.3275 4.8305 4.7895 4.6889 4.6833 4.6819 4.6542 4.6501 4.6287 4.6162 4.585 4.5535 4.5134 4.474 4.4089 4.3797 4.3494 4.3268 4.3218 4.3206 4.3206 4.3203 4.2975 4.2864 4.2821 4.2544 4.2288 4.2281 4.2265 4.2226 4.2206 4.2146 4.2144 4.2114 4.1923 4.19 4.1894 4.1785 4.178 4.1694 4.1694 4.1694 4.1556 4.1498 4.1498 4.1357 4.1222 4.1222 4.1217 4.1192 4.1178 4.1139 4.1135 4.1125 4.1035 4.1025 4.1023 4.0971 4.0969 4.0915 4.0915 4.0914 4.0836 4.0804 4.0803 4.0722 4.065 4.065 4.0649 4.0644 4.0637 4.0616 4.0616 4.061 4.0572 4.0563 4.056 4.0545 4.0545 4.0522 4.0519 4.0514 4.0484 4.0467 4.0463 4.0422 4.0392 4.0388 4.0385 4.0385 4.0383 4.038 4.0379 4.0375 4.0364 4.0353 4.0344];
x_axis = 1:numel(curve);
points = [x_axis ; curve ]'; %' - SO formatting

%% Get the scaling info
f = figure(1);
plot(points(:,1),points(:,2));
ticks = get(get(f,'CurrentAxes'),'YTickLabel');
ticks = str2num(ticks);
aspect = get(get(f,'CurrentAxes'),'DataAspectRatio');
aspect = [aspect(2) aspect(1)];    
close(f);   

%% Get the "origin"
O = [x_axis(1) ticks(1)];

%% Scale the data - now the scaled values look like MATLAB''s idea of
% what a good plot should look like
scaled_O = O.*aspect;
scaled_points = bsxfun(@times,points,aspect);

%% Find the closest point
del = sum((bsxfun(@minus,scaled_points,scaled_O).^2),2);
[val ind] = min(del);
best_ROC = [ind curve(ind)];

%% Display
plot(x_axis,curve,'.-');
hold on;
plot(O(1),O(2),'r*');
plot(best_ROC(1),best_ROC(2),'k*');

Results:

results

ALSO for the Fit(maximize) curve you'll have to change to origin to [x_axis(1) ticks(end)].

Jacob
an interesting idea, my only problem is the fact that you have to plot and use MATLAB's automatic scaling to find the origin.. This would not scale well especially since I have over a million of those curves to process on the fly. The xaxis is always guaranteed to start at 1, but I have no bound on the yaxis... and to answer you question, the shape is almost arbitrary although I expect a quick rise/fall in the beginning of the curve
Amro
So I suppose the problem is to figure out MATLAB's automatic scaling ... will check it out.
Jacob
+1  A: 

The point of information theoretic model selection is that it already accounts for the number of parameters. Therefore, there is no need to find an elbow, you need only find the minimum.

Finding the elbow of the curve is only relevant when using fit. Even then the method that you choose to select the elbow is in a sense setting a penalty for the number of parameters. To select the elbow you will want to minimize the distance from the origin to the curve. The relative weighting of the two dimensions in the distance calculation will create an inherent penalty term. Information theoretic criterion set this metric based on the number of parameters and the number of data samples used to estimate the model.

Bottom line recommendation: Use BIC and take the minimum.

KennyMorton
But finding finimum is unoptimal. I.e. if you will take a look into BIC curve, the minimum will be computed for the element at postion 100. But difference between complexity of 20 and 100 is quite small - there is too small gain in further complexion of the model.
Gacek
So you are saying that BIC has an inadequate penalty for model complexity (which my be true, although in my opinion it is not). My argument is that choosing the elbow of the curve through some method creates an ad hoc unknown penalty term. If you don't like the answer provided by BIC or AIC or any other IC based method, you would be better off developing a penalty term and using that. Just my opinion.
KennyMorton
in a way I agree with you because when using AIC and BIC, there is no absolute baseline against which we compare, they are rather used in a relative manner to compare models against each other (when all other things are equal) and by looking for this L-shape, we are effectively introducing an ad hoc penalty term which may uneven the playing field..The only problem with this is that 100 was chosen becuase of computational costs, and the true upper bound for the complexity is somthing near 10000
Amro
+4  A: 

A quick way of finding the elbow is to draw a line from the first to the last point of the curve and then find the data point that is farthest away from that line. This is of course somewhat dependent on the number of points you have in the flat part of the line, but if you test the same number of parameters each time, it should come out reasonably ok.

curve = [8.4663 8.3457 5.4507 5.3275 4.8305 4.7895 4.6889 4.6833 4.6819 4.6542 4.6501 4.6287 4.6162 4.585 4.5535 4.5134 4.474 4.4089 4.3797 4.3494 4.3268 4.3218 4.3206 4.3206 4.3203 4.2975 4.2864 4.2821 4.2544 4.2288 4.2281 4.2265 4.2226 4.2206 4.2146 4.2144 4.2114 4.1923 4.19 4.1894 4.1785 4.178 4.1694 4.1694 4.1694 4.1556 4.1498 4.1498 4.1357 4.1222 4.1222 4.1217 4.1192 4.1178 4.1139 4.1135 4.1125 4.1035 4.1025 4.1023 4.0971 4.0969 4.0915 4.0915 4.0914 4.0836 4.0804 4.0803 4.0722 4.065 4.065 4.0649 4.0644 4.0637 4.0616 4.0616 4.061 4.0572 4.0563 4.056 4.0545 4.0545 4.0522 4.0519 4.0514 4.0484 4.0467 4.0463 4.0422 4.0392 4.0388 4.0385 4.0385 4.0383 4.038 4.0379 4.0375 4.0364 4.0353 4.0344];
% get coordinates of all the points
nPoints = length(curve);
allCoord = [1:nPoints;curve]'; %' SO formatting
% pull out first point
firstPoint = allCoord(1,:);
% get vector between first and last point - this is the line
lineVec = allCoord(end,:) - firstPoint;
% normalize the line vector
lineVecN = lineVec / sqrt(sum(lineVec.^2));
% find the distance from each point to the line:
% vector between all points and first point
vecFromFirst = bsxfun(@minus,allCoord,firstPoint);

% To calculate the distance to the line, we split vecFromFirst
% into two components, one that is parallel to the line and one that is perpendicular
% Then, we take the norm of the part that is perpendicular to the line and get the
% distance.
% We find the vector parallel to the line by projecting vecFromFirst onto the line
% The perpendicular vector is vecFromFirst - vecFromFirstParallel
% We project vecFromFirst by taking the scalar product of the vector with the 
% unit vector that points in the direction of the line (this gives us the length
% of the projection of vecFromFirst onto the line). If we multiply the scalar product
% by the unit vector, we have vecFromFirstParallel
scalarProduct = dot(vecFromFirst,repmat(lineVecN,nPoints,1),2);
vecFromFirstParallel = scalarProduct * lineVecN;
vecToLine = vecFromFirst - vecFromFirstParallel;
% distance to line is the norm of vecToLine
distToLine = sqrt(sum(vecToLine.^2,2));
% plot the distance to the line
figure('Name','distance from curve to line'),plot(distToLine)

% now all you need is to find the maximum
[maxDist,idxOfBestPoint] = max(distToLine);

% plot
figure,plot(curve)
hold on
plot(allCoord(idxOfBestPoint,1),allCoord(idxOfBestPoint,2),'or')
Jonas
Thank you I really like this solution! I have to admit I am having a hard time following how you computed the point-line distance?
Amro
I agree that this was not a well-commented line. I have tried to describe it a bit better. Who'd have thought that geometry would come in so handy eventually?
Jonas
@Jonas That's a neat trick, would've never thought of that sort of way of finding elbows
JudoWill
thanks for the explanation. I eventually found this page that refreshed my geometry: http://en.wikipedia.org/wiki/Vector_projection
Amro