I have a problem involving a collection of continuous probability distribution functions, most of which are determined empirically (e.g. departure times, transit times). What I need is some way of taking two of these PDFs and doing arithmetic on them. E.g. if I have two values x taken from PDF X, and y taken from PDF Y, I need to get the PDF for (x+y), or any other operation f(x,y).
An analytical solution is not possible, so what I'm looking for is some representation of PDFs that allows such things. An obvious (but computationally expensive) solution is monte-carlo: generate lots of values of x and y, and then just measure f(x, y). But that takes too much CPU time.
I did think about representing the PDF as a list of ranges where each range has a roughly equal probability, effectively representing the PDF as the union of a list of uniform distributions. But I can't see how to combine them.
Does anyone have any good solutions to this problem?
Edit: The goal is to create a mini-language (aka Domain Specific Language) for manipulating PDFs. But first I need to sort out the underlying representation and algorithms.
Edit 2: dmckee suggests a histogram implementation. That is what I was getting at with my list of uniform distributions. But I don't see how to combine them to create new distributions. Ultimately I need to find things like P(x < y) in cases where this may be quite small.
Edit 3: I have a bunch of histograms. They are not evenly distributed because I'm generating them from occurance data, so basically if I have 100 samples and I want ten points in the histogram then I allocate 10 samples to each bar, and make the bars variable width but constant area.
I've figured out that to add PDFs you convolve them, and I've boned up on the maths for that. When you convolve two uniform distributions you get a new distribution with three sections: the wider uniform distribution is still there, but with a triangle stuck on each side the width of the narrower one. So if I convolve each element of X and Y I'll get a bunch of these, all overlapping. Now I'm trying to figure out how to sum them all and then get a histogram that is the best approximation to it.
I'm beginning to wonder if Monte-Carlo wasn't such a bad idea after all.
Edit 4: This paper discusses convolutions of uniform distributions in some detail. In general you get a "trapezoid" distribution. Since each "column" in the histograms is a uniform distribution, I had hoped that the problem could be solved by convolving these columns and summing the results.
However the result is considerably more complex than the inputs, and also includes triangles. Edit 5: [Wrong stuff removed]. But if these trapezoids are approximated to rectangles with the same area then you get the Right Answer, and reducing the number of rectangles in the result looks pretty straightforward too. This might be the solution I've been trying to find.
Edit 6: Solved! Here is the final Haskell code for this problem:
-- | Continuous distributions of scalars are represented as a
-- | histogram where each bar has approximately constant area but
-- | variable width and height. A histogram with N bars is stored as
-- | a list of N+1 values.
data Continuous = C {
cN :: Int,
-- ^ Number of bars in the histogram.
cAreas :: [Double],
-- ^ Areas of the bars. @length cAreas == cN@
cBars :: [Double]
-- ^ Boundaries of the bars. @length cBars == cN + 1@
} deriving (Show, Read)
{- | Add distributions. If two random variables @vX@ and @vY@ are
taken from distributions @x@ and @y@ respectively then the
distribution of @(vX + vY)@ will be @(x .+. y).
This is implemented as the convolution of distributions x and y.
Each is a histogram, which is to say the sum of a collection of
uniform distributions (the "bars"). Therefore the convolution can be
computed as the sum of the convolutions of the cross product of the
components of x and y.
When you convolve two uniform distributions of unequal size you get a
trapezoidal distribution. Let p = p2-p1, q - q2-q1. Then we get:
> | |
> | ______ |
> | | | with | _____________
> | | | | | |
> +-----+----+------- +--+-----------+-
> p1 p2 q1 q2
>
> gives h|....... _______________
> | /: :\
> | / : : \ 1
> | / : : \ where h = -
> | / : : \ q
> | / : : \
> +--+-----+-------------+-----+-----
> p1+q1 p2+q1 p1+q2 p2+q2
However we cannot keep the trapezoid in the final result because our
representation is restricted to uniform distributions. So instead we
store a uniform approximation to the trapezoid with the same area:
> h|......___________________
> | | / \ |
> | |/ \|
> | | |
> | /| |\
> | / | | \
> +-----+-------------------+--------
> p1+q1+p/2 p2+q2-p/2
-}
(.+.) :: Continuous -> Continuous -> Continuous
c .+. d = C {cN = length bars - 1,
cBars = map fst bars,
cAreas = zipWith barArea bars (tail bars)}
where
-- The convolve function returns a list of two (x, deltaY) pairs.
-- These can be sorted by x and then sequentially summed to get
-- the new histogram. The "b" parameter is the product of the
-- height of the input bars, which was omitted from the diagrams
-- above.
convolve b c1 c2 d1 d2 =
if (c2-c1) < (d2-d1) then convolve1 b c1 c2 d1 d2 else convolve1 b d1
d2 c1 c2
convolve1 b p1 p2 q1 q2 =
[(p1+q1+halfP, h), (p2+q2-halfP, (-h))]
where
halfP = (p2-p1)/2
h = b / (q2-q1)
outline = map sumGroup $ groupBy ((==) `on` fst) $ sortBy (comparing fst)
$ concat
[convolve (areaC*areaD) c1 c2 d1 d2 |
(c1, c2, areaC) <- zip3 (cBars c) (tail $ cBars c) (cAreas c),
(d1, d2, areaD) <- zip3 (cBars d) (tail $ cBars d) (cAreas d)
]
sumGroup pairs = (fst $ head pairs, sum $ map snd pairs)
bars = tail $ scanl (\(_,y) (x2,dy) -> (x2, y+dy)) (0, 0) outline
barArea (x1, h) (x2, _) = (x2 - x1) * h
Other operators are left as an exercise for the reader.