tags:

views:

183

answers:

7

I'm using C++ to write a ROOT script for some task. At some point I have an array of doubles in which many are quite similar and one or two are different. I want to average all the number except those sore thumbs. How should I approach it? For an example, lets consider:

x = [2.3, 2.4, 2.11, 10.5, 1.9, 2.2, 11.2, 2.1]

I want to somehow average all the numbers except 10.5 and 11.2, the dissimilar ones. This algorithm is going to repeated several thousand times and the array of doubles has 2000 entries, so optimization (while maintaining readability) is desired. Thanks SO!

Check out: http://tinypic.com/r/111p0ya/3 The "dissimilar" numbers of the y-values of the pulse.

The point of this to determine the ground value for the waveform. I am comparing the most negative value to the ground and hoped to get a better method for grounding than to average the first N points in the sample.

A: 

Have a look here. This will give you a mathematical way of trimming off those outliers.

http://en.wikipedia.org/wiki/Normal_distribution

Daniel A. White
A: 

A quick way might be to take the median, and then take the averages of number not so far off from the median.

"Not so far off," being dependent of your project.

Junier
A: 

A good rule of thumb for determining likely outliers is to calculate the Interquartile Range (IQR), and then any values that are 1.5*IQR away from the nearest quartile are outliers.

This is the basic method many statistics systems (like R) use to automatically detect outliers.

Sean Nyman
A: 

Any method that is statistically significant and a good way to approach it (Dark Eru, Daniel White) will be too computationally intense to repeat, and I think I've found a work around that will allow later correction (meaning, leave it un-grounded).

Thanks for the suggestions. I'll look into them if I have time and want to see if their gain is worth the slowdown.

vgm64
Mind enlightening us on the method you plan to use?
a_m0d
A: 

Here's a quick and dirty method that I've used before (works well if there are very few outliers at the beginning, and you don't have very complicated conditions for what constitutes an outlier)

The algorithm is O(N). The only really expensive part is the division.

The real advantage here is that you can have it up and running in a couple minutes.

avgX = Array[0]  // initialize array with the first point
N = length(Array)
percentDeviation = 0.3  // percent deviation acceptable for non-outliers
count = 1
foreach x in Array[1..N-1]
    if      x < avgX + avgX*percentDeviation
       and  x > avgX - avgX*percentDeviation
          count++
          sumX =+ x
          avgX = sumX / count
    endif
endfor

return avgX
Colin
+1  A: 

Given that you are using ROOT you might consider looking at the TSpectrum classes which have support for extracting backgrounds from under an unspecified number of peaks...

I have never used them with so much baseline noise, but they ought to be robust.

BTW: what is the source of this data. The peak looks like a particle detector pulse, but the high level of background jitter suggests that you could really improve things by some fairly minor adjustments in the DAQ hardware, which might be better than trying to solve a difficult software problem.

Finally, unless you are restricted to some very primitive hardware (in which case why and how are you running ROOT?), if you only have a couple thousand such spectra you can afford a pretty slow algorithm. Or is that 2000 spectra per event and a high event rate?

dmckee
I've used TSpectrum in a later portion in my analysis; thought early today to explore it's use for the waveforms. Good suggestion.Signal is from Hamamatsu PMT and kludge base. DAQ through Tektronix scope using LabVIEW -> Output Waveform to text. Analysis is done with ROOT after transfer of text files. The ROOT script creates hists of pulse area and pulse height.10,000 waveform captures (2000 sample pts) takes about an hour. Analysis is only on the order of minutes, so I have spare time, I'm just impatient. j
vgm64
Yes, I know 2.7 waveforms/sec is a slow DAQ. =) There are many flaws to my setup, but it is getting the job done.
vgm64
Not a hardware chain I have experience with. Still: if the vertical scale is in Volts, the peak is pretty small. Can you through more gain at it (best with more HV, but even in the scope might work).
dmckee
I'm studying pulse spectrum; I need to get the small and big pulses so the range is set to not exclude either... but the resolution for small ones sucks. External amplification can't be used because its a gain study of the PMTs.
vgm64
+1  A: 

If you can, maintain a sorted list; then you can easily chop off the head and the tail of the list each time you work out the average.

This is much like removing outliers based on the median (ie, you're going to need two passes over the data, one to find the median - which is almost as slow as sorting for floating point data, the other to calculate the average), but requires less overhead at the time of working out the average at the cost of maintaining a sorted list. Which one is fastest will depend entirely on your circumstances. It may be, of course, that what you really want is the median anyway!

If you had discrete data (say, bytes=256 possible values), you could use 256 histogram 'bins' with a single pass over your data putting counting the values that go in each bin, then it's really easy to find the median / approximate the mean / remove outliers, etc. This would be my preferred option, if you could afford to lose some of the precision in your data, followed by maintaining a sorted list, if that is appropriate for your data.

James
Sorting isn't a bad idea, and if I histogram these values I could very easily fit the gaussian peak, the mean of which would be the ground value I'm after, but may be too involved.
vgm64