ansaurus

Question

Representing continuous probability distributions

Answer 1

+1 A:

Is there anything that stops you from employing a mini-language for this?

By that I mean, define a language that lets you write f = x + y and evaluates f for you just as written. And similarly for g = x * z, h = y(x), etc. ad nauseum. (The semantics I'm suggesting call for the evaluator to select a random number on each innermost PDF appearing on the RHS at evaluation time, and not to try to understand the composted form of the resulting PDFs. This may not be fast enough...)

Assuming that you understand the precision limits you need, you can represent a PDF fairly simply with a histogram or spline (the former being a degenerate case of the later). If you need to mix analytically defined PDFs with experimentally determined ones, you'll have to add a type mechanism.

A histogram is just an array, the contents of which represent the incidence in a particular region of the input range. You haven't said if you have a language preference, so I'll assume something c-like. You need to know the bin-structure (uniorm sizes are easy, but not always best) including the high and low limits and possibly the normalizatation:

struct histogram_struct {
  int bins; /* Assumed to be uniform */
  double low;
  double high;
  /* double normalization; */    
  /* double *errors; */ /* if using, intialize with enough space, 
                         * and store _squared_ errors
                         */
  double contents[];
};

This kind of thing is very common in scientific analysis software, and you might want to use an existing implementation.

dmckee 2008-12-28 08:52:01

I do indeed want to write a mini-language. But I want the underlying semantics to be something more efficient than monte-carlo.

Paul Johnson 2008-12-28 08:55:28

Sorry, I wasn't clear. Do you really need to know what the composite PDFs are, or do you simply need to be able to draw numbers from the efficiently? The later case really only requires a good interpreter for PDF-composition inside you code.

dmckee 2008-12-28 08:58:10

I want to know what the composite PDFs are. Ultimately I need to be able to determine things like P(x<y).

Paul Johnson 2008-12-28 09:24:50

> you can represent a PDF fairly simply with a histogramYes. Do you know of any algorithms for doing this.

Paul Johnson 2008-12-28 09:27:24

::casts about for a way to save this idea:: Uh, er, um. How about a cluster. Yeah, you need a cluster...

dmckee 2008-12-28 09:28:32

Answer 2

+1 A:

Autonomous mobile robotics deals with similar issue in localization and navigation, in particular the Markov localization and Kalman filter (sensor fusion). See An experimental comparison of localization methods continued for example.

Another approach you could borrow from mobile robots is path planning using potential fields.

eed3si9n 2008-12-28 08:57:03

Answer 3

A:

If you want some fun, try representing them symbolically like Maple or Mathemetica would do. Maple uses directed acyclic graphs, while Matematica uses a list/lisp like appoach (I believe, but it's been a loooong time, since I even thought about this).

Do all your manipulations symbolically, then at the end push through numerical values. (Or just find a way to launch off in a shell and do the computations).

Paul.

Paul W Homer 2008-12-28 18:23:23

Answer 4

+1 A:

A couple of responses:

1) If you have empirically determined PDFs they either you have histograms or you have an approximation to a parametric PDF. A PDF is a continuous function and you don't have infinite data...

2) Let's assume that the variables are independent. Then if you make the PDF discrete then P(f(x,y)) = f(x,y)p(x,y) = f(x,y)p(x)p(y) summed over all the combinations of x and y such that f(x,y) meets your target.

If you are going to fit the empirical PDFs to standard PDFs, e.g. the normal distribution, then you can use already-determined functions to figure out the sum, etc.

If the variables are not independent, then you have more trouble on your hands and I think you have to use copulas.

I think that defining your own mini-language, etc., is overkill. you can do this with arrays...

af 2008-12-28 20:06:53

Answer 5

+1 A:

Some initial thoughts:

First, Mathematica has a nice facility for doing this with exact distributions.

Second, representation as histograms (ie, empirical PDFs) is problematic since you have to make choices about bin size. That can be avoided by storing a cumulative distribution instead, ie, an empirical CDF. (In fact, you then retain the ability to recreate the full data set of samples that the empirical distribution is based on.)

Here's some ugly Mathematica code to take a list of samples and return an empirical CDF, namely a list of value-probability pairs. Run the output of this through ListPlot to see a plot of the empirical CDF.

empiricalCDF[t_] := Flatten[{{#[[2,1]],#[[1,2]]},#[[2]]}&/@Partition[Prepend[Transpose[{#[[1]], Rest[FoldList[Plus,0,#[[2]]]]/Length[t]}&[Transpose[{First[#],Length[#]}&/@ Split[Sort[t]]]]],{Null,0}],2,1],1]

Finally, here's some information on combining discrete probability distributions:

http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/Chapter7.pdf

dreeves 2008-12-29 05:57:57

I agree with your point about bin size, but the volume of historical data is large, so I'm going to stick with bins. Useful to see the stuff about empirical CDFs though. I'd already read the chapter you point to. Thanks.

Paul Johnson 2008-12-29 19:18:30

Answer 6

+1 A:

I think the histograms or the list of 1/N area regions is a good idea. For the sake of argument, I'll assume that you'll have a fixed N for all distributions.

Use the paper you linked edit 4 to generate the new distribution. Then, approximate it with a new N-element distribution.

If you don't want N to be fixed, it's even easier. Take each convex polygon (trapezoid or triangle) in the new generated distribution and approximate it with a uniform distribution.

Mr Fooz 2008-12-30 14:57:39

Answer 7

+1 A:

Another suggestion is to use kernel densities. Especially if you use Gaussian kernels, then they can be relatively easy to work with... except that the distributions quickly explode in size without care. Depending on the application, there are additional approximation techniques like importance sampling that can be used.

Mr Fooz 2008-12-30 15:02:41

ansaurus

tags:

views:

answers:

Representing continuous probability distributions

related questions