Maybe a good method would be to ignore any results that are more than some defined value outside the current running average?
The definition of what constitutes an abnormal value must scale to the data itself. The classic method of doing this is to calculate the z score of each of the data points and throwing out any values greater than 3 z scores from the average. The z score can be found by taking the difference between the data point and the average and dividing by the standard deviation.
You need to have some idea of expected variation or distribution, if you want to be able to exclude certain (higher) instances of variation as erroneous. For instance, if you can approximate the distribution of the "average times" result to a normal (Gaussian) distribution, then you can do what ojblass suggested and exclude those results that exhibit a variation that is greater than some multiple of the standard deviation (which can be calculated on the fly alongside your running average). If you wanted to exclude results that have a 99.75 (or so) percent chance of being erroneous, exclude those that vary more than 3 standard deviations from the mean. If you only want 95% certainty, exclude those that vary more than 2 standard deviations and so on.
I'm sure a little bit of googling for "standard deviation" or "gaussian distribution" will help you. Of course, this assumes that you expect a normal distribution of results. You might not. In which case, the first step would be to guess at what distribution you expect.
If that example graph you have is typical, then any of the criteria you list will work. Most of those statistical methods are for riding the edge of errors right at the fuzzy level of "is this really an error?" But your problem looks wildly simple.. your errors are not just a couple standard deviations from the norm, they're 20+. This is good news for you.
So, use the simplest heuristic. Always accept the first 5 points or so in order to prevent a startup spike from ruining your computation. Maintain mean and standard deviation. If your data point falls 5 standard deviations outside the norm, then discard it and repeat the previous data point as a filler.
If you know your typical data behavior in advance you may not even need to compute mean and standard deviation, you can hardwire absolute "reject" limits. This is actually better in that an initial error won't blow up your detector.
The naive (and possibly best) answer to the bootstrapping question is "Accept the first N values without filtering." Choose N to be as large as possible while still allowing the setup time to be "short" in your application. In this case, you might consider using the widow width (64 samples) for N.
Then I would go with some kind of mean and sigma based filter.
I would compute a running median (robust alternative to mean) and a running mad (robust alternative to sd), remove everything that more than 5 mad's away from the median http://epp.eurostat.ec.europa.eu/portal/page/portal/research_methodology/documents/S4P1_MIRROROUTLIERDETECTION_LIAPIS.pdf