Bug distribution

I have a program that I'm porting from one language to another. I'm doing this with a translation program that I'm developing myself. The relevant result of this is that I expect that there are a number of bugs in my system that I am going to need to find and fix. Each bug is likely to manifest in many places and fixing it will fix the bug in all the places it shows up. (I feel like a have a really big lever and I'm pushing on the short end, I push really hard but when things move they move a lot.)

I have the ability to run execution log diffs so I'm measuring my progress by how far through the test suite I can run it before it diverges from the original program's execution. (Thank [whatever you want] for BeyondCompare, it works reasonably well with ~1M line files :D)

The question is: What shape should I expect to see if i were to plot that run length as a function of time? (more time == more bugs removed)

My first thought is something like a Poisson distribution. However because fixing each bug also removes all other occurrences of it, that shouldn't be quite correct.

(BTW this might have real world implications with regards to estimating when programs will finish being debugged.)

Edit: A more abstract statement of the problem:

Given an ordered list of N integers selected from the range [0,M] (where N>>M) with a uniform distribution along the positions in the list, but not necessarily with a uniform distribution of numbers. What is the expected location of that last “new” number? What about the second to last? Etc?

In general, the number of new bugs found as a function of time should follow a Poisson-like distribution. Assuming bugs are fixed essentially as they are found, then the number of open bugs should follow the same distribution.

I actually used this early in my career to "prove" to my business unit that a particular feature set wasn't ready for release. I graphed new and open bugs as a function of time for the current project, and also for the previous two versions. The two older data sets showed an initial steep ramp up, a peak, and a gradual decline until their release dates. The current data showed a linear increase that continued to the day I created the graph.

We were given several more days of testing, and the testers were given training on how to test the product more effectively. Thanks to both decisions, the release was relatively defect-free.

+0.99 for cute, to bad I cant give you 0.99 up votes

BCS 2008-11-04 02:40:46

*Love* the analysis! Only half-cute, though, as I expect both the exponential curve and a certain amount of randomness to be accurate.

Adam Liss 2008-11-04 02:44:27

Personally, I'm capable of rounding...

Jim Burger 2008-11-04 03:02:08

@Jim: Your "well-rounded" attitude is appreciated!

Adam Liss 2008-11-05 05:27:04

"several more *days*" What is your release cycle? I've never worked with anything that short.

BCS 2008-11-04 04:22:55

Thanks for catching that. :-) It was a fairly quick "must-have" customization in a very non-technical company, back when "agile" meant "able to dodge bullets." I learned a *lot* at that job!

Adam Liss 2008-11-04 13:32:25

ansaurus

tags:

views:

answers:

related questions