views:

560

answers:

3

Consider the set of non-decreasing surjective (onto) functions from (-inf,inf) to [0,1]. (Typical CDFs satisfy this property.) In other words, for any real number x, 0 <= f(x) <= 1. The logistic function is perhaps the most well-known example.

We are now given some constraints in the form of a list of x-values and for each x-value, a pair of y-values that the function must lie between. We can represent that as a list of {x,ymin,ymax} triples such as

constraints = {{0, 0, 0}, {1, 0.00311936, 0.00416369}, {2, 0.0847077, 0.109064}, 
 {3, 0.272142, 0.354692}, {4, 0.53198, 0.646113}, {5, 0.623413, 0.743102}, 
 {6, 0.744714, 0.905966}}

Graphically that looks like this:

constraints on a cdf

We now seek a curve that respects those constraints. For example:

fitted cdf

Let's first try a simple interpolation through the midpoints of the constraints:

mids = ({#1, Mean[{#2,#3}]}&) @@@ constraints
f = Interpolation[mids, InterpolationOrder->0]

Plotted, f looks like this:

interpolated cdf

That function is not surjective. Also, we'd like it to be smoother. We can increase the interpolation order but now it violates the constraint that its range is [0,1]:

interpolated cdf with higher interpolation order

The goal, then, is to find the smoothest function that satisfies the constraints:

  1. Non-decreasing.
  2. Tends to 0 as x approaches negative infinity and tends to 1 as x approaches infinity.
  3. Passes through a given list of y-error-bars.

The first example I plotted above seems to be a good candidate but I did that with Mathematica's FindFit function assuming a lognormal CDF. That works well in this specific example but in general there need not be a lognormal CDF that satisfies the constraints.

A: 

You can try to fit a Bezier curve through the midpoints. Specifically I think you want a C2 continuous curve.

jeffamaphone
Thanks jeff. I think the tricky part is what I've now listed as criterion 2 in my revised version of the question, or what unutbu lists as criterion 5: the function should tend to 1 as x goes to infinity.
dreeves
+4  A: 
unutbu
Thanks for this answer! I didn't know about Monotone Cubic Interpolation. If anyone knows of a Mathematica implementation of that, please post it as an answer. You're right that the criteria do not uniquely determine a CDF; thanks for clarifying what the criteria are! (Btw, your #4 is actually subsumed by #3 -- note the first degenerate error bar in my list of example constraints.)
dreeves
As for your proposed hack for criterion 5: I tried that with Mathematica's normal cubic interpolation (not enforcing monotonicity, though that didn't seem to be an issue) and didn't find it satisfying. I don't want the interpolating function to exceed 1 for any X. True, X can be made big enough that it doesn't matter in practice, but then the function doesn't go to 1 quickly enough.
dreeves
What if we stated the problem as "find the smoothest possible non-decreasing function from (-inf,inf) to [0,1] such that it passes through a given list of error bars"? Maybe I'll edit the question to put it that way.
dreeves
I've now reworked the question, thanks in part to your answer.
dreeves
Wow, your transformation idea in Comment 2 is really clever. I hadn't seen that when I wrote my solution (now posted as an answer and perhaps sufficient for my immediate purposes). Thanks again for all the help on this!
dreeves
+2  A: 

I have found a solution that gives reasonable results for a variety of inputs. I start by fitting a model -- once to the low ends of the constraints, and again to the high ends. I'll refer to the mean of these two fitted functions as the "ideal function". I use this ideal function to extrapolate to the left and to the right of where the constraints end, as well as to interpolate between any gaps in the constraints. I compute values for the ideal function at regular intervals, including all the constraints, from where the function is nearly zero on the left to where it's nearly one on the right. At the constraints, I clip these values as necessary to satisfy the constraints. Finally, I construct an interpolating function that goes through these values.

My Mathematica implementation follows.
First, a couple helper functions:

(* Distance from x to the nearest member of list l. *)
listdist[x_, l_List] := Min[Abs[x - #] & /@ l]

(* Return a value x for the variable var such that expr/.var->x is at least (or
   at most, if dir is -1) t. *)
invertish[expr_, var_, t_, dir_:1] := Module[{x = dir},
  While[dir*(expr /. var -> x) < dir*t, x *= 2];
  x]

And here's the main function:

(* Return a non-decreasing interpolating function that maps from the
   reals to [0,1] and that is as close as possible to expr[var] without
   violating the given constraints (a list of {x,ymin,ymax} triples).
   The model, expr, will have free parameters, params, so first do a
   model fit to choose the parameters to satisfy the constraints as well
   as possible. *)
cfit[constraints_, expr_, params_, var_] := 
Block[{xlist,bots,tops,loparams,hiparams,lofit,hifit,xmin,xmax,gap,aug,bests},
  xlist = First /@ constraints;
  bots = Most /@ constraints; (* bottom points of the constraints *)
  tops = constraints /. {x_, _, ymax_} -> {x, ymax};
  (* fit a model to the lower bounds of the constraints, and 
     to the upper bounds *)
  loparams = FindFit[bots, expr, params, var];
  hiparams = FindFit[tops, expr, params, var];
  lofit[z_] = (expr /. loparams /. var -> z);
  hifit[z_] = (expr /. hiparams /. var -> z);
  (* find x-values where the fitted function is very close to 0 and to 1 *)
  {xmin, xmax} = {
    Min@Append[xlist, invertish[expr /. hiparams, var, 10^-6, -1]],
    Max@Append[xlist, invertish[expr /. loparams, var, 1-10^-6]]};
  (* the smallest gap between x-values in constraints *)
  gap = Min[(#2 - #1 &) @@@ Partition[Sort[xlist], 2, 1]];
  (* augment the constraints to fill in any gaps and extrapolate so there are 
     constraints everywhere from where the function is almost 0 to where it's 
     almost 1 *)
  aug = SortBy[Join[constraints, Select[Table[{x, lofit[x], hifit[x]}, 
                                              {x, xmin,xmax, gap}], 
                                        listdist[#[[1]],xlist]>gap&]], First];
  (* pick a y-value from each constraint that is as close as possible to 
     the mean of lofit and hifit *)
  bests = ({#1, Clip[(lofit[#1] + hifit[#1])/2, {#2, #3}]} &) @@@ aug;
  Interpolation[bests, InterpolationOrder -> 3]]

For example, we can fit to a lognormal, normal, or logistic function:

g1 = cfit[constraints, CDF[LogNormalDistribution[mu,sigma], z], {mu,sigma}, z]
g2 = cfit[constraints, CDF[NormalDistribution[mu,sigma], z], {mu,sigma}, z]
g3 = cfit[constraints, 1/(1 + c*Exp[-k*z]), {c,k}, z]

Here's what those look like for my original list of example constraints:

constrained fit to lognormal, normal, and logistic function

The normal and logistic are nearly on top of each other and the lognormal is the blue curve.

These are not quite perfect. In particular, they aren't quite monotone. Here's a plot of the derivatives:

Plot[{g1'[x], g2'[x], g3'[x]}, {x, 0, 10}]

the derivatives of the fitted functions

That reveals some lack of smoothness as well as the slight non-monotonicity near zero. I welcome improvements on this solution!

dreeves
I noticed the obvious smoothness problems go away when I use the option Method->"Spline" in the Interpolation call. I don't yet know how to fix the monotonicity problem though.
dreeves
Actually, just setting InterpolationOrder->1 (linear interpolation) in the Interpolation call solves the monotonicity problem but of course makes the curve much less smooth. It's still roughly the same shape though and could be improved to an arbitrary degree by increasing the sampling granularity of the fitted function before interpolating.
dreeves