views:

191

answers:

6

I have a bunch of sets of data (between 50 to 500 points, each of which can take a positive integral value) and need to determine which distribution best describes them. I have done this manually for several of them, but need to automate this going forward.

Some of the sets are completely modal (every datum has the value of 15), some are strongly modal or bimodal, some are bell-curves (often skewed and with differing degrees of kertosis/pointiness), some are roughly flat, and there are any number of other possible distributions (possion, power-law, etc.). I need a way to determine which distribution best describes the data and (ideally) also provides me with a fitness metric so that I know how confident I am in the analysis.

Existing open-source libraries would be ideal, followed by well documented algorithms that I can implement myself.

+2  A: 

Looking for a distribution that fits is unlikely to give you good results in the absence of some a priori knowledge. You may find a distribution that coincidentally has a good fitness but is unlikely to be the underlying distribution.

Do you have any metadata available that would hint at what the data means? E.g., "this is open-ended data sampled from a natural population, so it's some sort of normal distribution", vs. "this data is inherently bounded at 0 and discrete, so check for the best-fitting Poisson".

I don't know of any distribution solvers for Java off the top of my head, and I don't know of any that will guess which distribution to use. You could examine some statistical properties (skew/etc.) and make some guesses here--but you're more likely to end up with an accidentally good fit which does not adequately represent the underlying distribution. Real data is noisy and there are just too many degrees of freedom if you don't even know what distribution it is.

Alex Feinman
I have a good idea for each of these data-sets of what the distribution 'should' be, along with a few alternative distributions that it might be. For example, my most common use case will expect it to be normal, but if it isn't, then it is most likely modal or flat.
Eadwacer
For those it's pretty easy. A stdev will tell you whether it's modal or flat. Normalize the data (to unit amplitude and zero mean) and measure its fitness to a standard distribution. It gets more complicated for distributions with more than 2 variables (e.g. gaussian, which adds width).
Alex Feinman
+1  A: 

What you're looking for comes under the general heading of "goodness of fit." You could search on "goodness of fit test."

Donald Knuth describes a couple popular goodness of fit tests in Seminumerical Algorithms: the chi-squared test and the Kolmogorov-Smirnov test. But you've got to have some idea first what distribution you want to test. For example, if you have bell curve data, you might try normal or Cauchy distributions.

John D. Cook
+1  A: 

Look at Apache commons-math.

bmargulies
Checking it out now. This looks very useful.
Eadwacer
+2  A: 

This may be above and beyond what you want to do, but it seems the most complete approach (and it allows access to the wealth of statistical knowledge available inside R):

  1. use JRI to communicate with the R statistical language
  2. use R, internally, as indicated in this thread
tucuxi
+1 - use the right tool for the job, always good advice.
Carl
A: 

I've heard of a package called Eureqa that might fill the bill nicely. I've only downloaded it; I haven't tried it myself yet.

duffymo
A: 

If all you really need the distribution for is to model the data you have sampled, you can make your own distribution based on the data you have:

1. Create a histogram of your sample: One method for selecting the bin size is here. There are other methods for selecting bin size, which you may prefer.

2. Derive the sample CDF: Think of the histogram as your PDF, and just compute the integral. It's probably best to scale the height of the bins so that the CDF has the right characteristics ... namely that the value of the CDF at +Infinity is 1.0.

To use the distribution for modeling purposes:

3. Draw X from your distribution: Make a draw Y from U(0,1). Use a reverse lookup on your CDF of the value Y to determine the X such that CDF(X) = Y. Since the CDF is invertible, X is unique.

andand