views:

375

answers:

16

Hi All,

I am sure there are lot of Software Testing Engineers, Algorithm Validation Engineers on Stackoverflow.Could someone please tell me how would one proceed in the following scenario.

Say we have a Mammogram and 5 different algorithms which take this mammogram as input and identify if there is Cancer in the patient. If 3 out of 5 algorithms say there is cancer in the patient and 2 say there is no cancer in the patient. Which algorithm should I believe. How should I proceed testing these algorithms. Is there any statistical concept used in such scenarios?

I was asked this question in an interview for Algorithm Validation Engineer position. I believe they were trying to see how I would think given such scenario. How should I have answered this?

Thanks for your time

-Sashi

+7  A: 

You can't say anything having only this information. What if some of the algotithms reuse some other algorithms from those 5? Then they could be prone to the same defects.

Say A, B and C actually use the same sub-algorithm for preprocessing data and the latter gives suboptimal results on some specific image and so the preprocessed image causes the later phases to produce wrong results - it won't matter you have three algorithms saying the same.

You need more specific data on how the algorithms correlate and what statistical characteristics are knwon about there error rates to be able to perform any analysis.

sharptooth
But, this is an interview, so he _can_ ask that, and it shows a lot to the interviewer.
Stephen
+4  A: 

This is actually pretty tough to answer. I'm sure each algorithm is good at picking up different types of input triggers. More than likely, you'll need some statistical analysis to determine what each algorithm will usually detect as cancer. Additionally, you may go so far as to do something like make a Bayesian model to describe/determine if a patient has cancer based off of the algorithmic results.

you may find that 3 algorithms consistently miss the specific type of cancer that the other two are moderately good at picking up. you may discover similar relationships that occur, like when algorithms 2, 3, and 5 say there is no cancer, algorithm 1 says there is, and algorithm 4 is inconclusive, that there are usually typically-benign spots of a certain shape and color intensity that should be analyzed but probably aren't cancer.

San Jacinto
+1  A: 

I would have asked if using a computer to determine whether someone has cancer is the correct course of action, given that the use of algorithms is error prone.

But, if for some reason a set of algorithms must be used, then have a human operator (ie, a doctor) examine the mammogram personally in the case where there is some uncertainty. The doctor can then decide if further tests are warranted, based on the disagreement of the algorithms used.

The one thing we overlook as programmers, is that humans can some solve problems we cannot predict; imagine the doctor notices something in the mammogram that algorithms were not designed to detect?

Charlie Salts
And the use of human operators is not error-prone? :-)
Ken
While image processing software is getting better all the time, Human are currently superior, but slower. We have a knack for finding visual patterns that seems to be wire into our brains. That is why groups like galaxy zoo use Human volunteers instead of computers to analyze images.
Jim C
@Jim C: Exactly. There's a trade-off. Using a camera + software to find bad widgets as they come off the assembly line is a lot different than using software to identify cancer. For medical applications, I'd rather be 100% correct and slower, than 99% correct and super fast. In cases where making a diagnosis faster than a doctor could mean life or death, I'd be willing to let the computer do the work.
Charlie Salts
Nice. I want to get me one of them 100% correct doctors. :-)
Ken
@Ken: You'll find them in the same place you'll find bug-free software, bug-free compilers, and 100% correct programmers. I'm not saying doctors are perfect - for accuracy, thought, I'm still relying on a doctor to make the final decision.
Charlie Salts
+1  A: 

I think that if you had some statistical information about each algorithm previous performances(how many times it was right/wrong on a number of statistical experiments), then you could calculate the probability of being right for each algorithm. Then you could somehow combine those probabilities to get the odds of that person having cancer. Just a speculation...

calin014
+2  A: 

Well, clearly false negatives here are far more serious here than false positives, so all things being equal, we may want to show a preference for algorithms that find more cancer.

If we feed lots more mammograms to the software and we find a collection of the algorithms seem to agree on a large sample of the mammograms, then we may want to prefer those algorithms, since their results are supported by more algorithms.

Something like that.

Ishmael
+2  A: 

Everything else being equal, you could say that the patient has a 60% chance of cancer. To give a better answer you need to know more information about how algorithm works. Some points to consider:

  • Perhaps some algorithms are newer than others, or have been proven to be less reliable. It would be good to know the accuracy of each algorithm, using historical mammogram data tagged as "Cancerous" and "Non-Cancerous".
  • Each person's cancer is slightly different - maybe there are characteristics that a certain algorithm is better at identifying? Is a domain expert required in order to determine which diagnosis is correct, based on the algorithm conclusions and the mammogram (image?) data?
  • As sharptooth mentioned, perhaps certain algorithms use the same techniques as other algorithms, so both are likely to have the same bias.
Justin Ethier
+1  A: 

To accomplish much in such a situation, you generally want to have a "gold" standard -- e.g., an opinion from a physician about whether a set of mammograms shows cancer, or use historical information where you know that one set of mammograms show cancer and another set does not. Along with that, if possible, you'd like information about what indicators each algorithm is using in a particular case.

With a standard, you can start to get an estimate of which algorithm(s) is/are more "accurate" (i.e., agree with the expert opinion the most often). Information about indicators allows you to generate more detail about the times and circumstances under which each appears to be more or less accurate, so you can start to form a judgement about times/circumstances under which to trust one over another. With this, you can (at least hope to) combine the results from your five existing algorithms into a single overall result that's (with care and perhaps a bit of luck) more accurate than any of them individually.

Jerry Coffin
+1  A: 

Basically, if you know that the results of the algorithms are conditionally independent (i.e. independent given the true but unknown class label) then using Naive Bayes is an optimal meta-classifier.

Otherwise, this question is unanswerable without knowing the conditional dependency structure among the classifiers. For example, if classifier A, B, C and D are weak, identical classifiers (i.e. they always give the same results) and have accuracies of 0.51, while classifier E is conditionally independent of classifiers A, B, C and D and has an accuracy of 0.99 then I think it's pretty obvious that voting is a bad idea.

dsimcha
+2  A: 

It is not a trivial problem, and highly dependent on what risks you are willing to take.

Formalisms like decision theory and bayesian inference are really to be considered here. It allows you to take into account various probabilities of false positive/negatives, and whether you want to weigh them differently.

Alexandre C.
+1  A: 

Since the algorithm produces a "yes" or "no" answer, this is pretty easy. You need actual test data to compare your algorithms to. You should probably gather long-term data on the success rates of the various heuristics, and do some statistical analysis of which ones are more likely to be right.

Validating things like Google's search algorithm - which has no "right" answer - would be harder.

Dean J
+1  A: 

Go back and look at the trend data for each of the algorithms. In the past how often has A been correct, B correct etc. The typical route from here would be to run all of the algorithms and apply a Bayesian wighting system but I think this approach is far too general because it depends heavily on the quality of the source data. Since each algorithm excels with a specific type of input a more specialized approach would be to create a filter that screens the source data for markers that cater to a particular algorithm's specialty. For example if the source comes from an older machine you don't want an algorithm that deals poorly with image noise analyzing it. A specialist with mammogram technology would be a great asset to help identify more specific markers. Perhaps you could apply a weighting system after this filtration process to provide a better estimate of certainty.

Peter Hanneman
+2  A: 

I don't think you should have answered in any specific way. The interviewer would probably want to analyze how you would evaluate this problem, not your final answer. In other words, they were probably interested in your own algorithm to take a decision.

In a real life environment, I can't think of any serious choice between 5 algorithms to find cancer, specially when they give so different results.

Paulo Guedes
+2  A: 

This is a good opportunity for implementing what is sometimes called an "expert system". You take a large sample of your data (in your case, mammogram images and the output of the various algorithms) and run it past a series of real-life, flesh and blood experts in the field (here, oncologists or lab techs). Record the responses for each image along with the outputs of the algorithms. At the end, you should have enough data to map algorithm output to expert output. To verify that your mapping works, run a bunch of test images through your system (samples that were not part of the original data set) and ask your panel of experts to double-check the results. Ideally, the experts should agree with your system's output a very high percentage of the time.

Without knowing anything about the algorithms itself, it's difficult to make a decision based off of 3 "yes" and 2 "no" results (especially for something as important as a cancer screening). Getting as close as possible to the same results as a trained expert is your goal (at least at first), and systems like this can sometimes be made more accurate by basing decisions on the knowledge and experience of experts in the field rather than on mathematical algorithms alone.

bta
+3  A: 

Put your interviewee hat on, it's a psychological assessment. Questions such as this algorithm assessment have more than one correct answer. I learned about these questions from my wife who worked as a recruiter for 5+ years. The interviewer wants to see how you react. Its best to just make assumptions and drive to a logical conclusion. Do not say "I don't know", become argumentative, or ask a ton of questions. You will appear difficult and argumentative (as many programmers are).

Now that you know this is not a programming question, consider asking about it on careeroverflow.com. I like these questions because it shows an ability to adapt and become nonrigid.

Why is a manhole round? <--Microsoft's version

P.Brian.Mackey
A: 

Based on the information given, you would not be able to answer. You would have to take all 5 algorithms, and test them on patients diagnosed with cancer, and also ones that are known to be cancer free. This would allow you to determine which algorithm was the most accurate.

You also could make an algorithm from the 5 (presuming they were all good and valid algortihms) and taking the side of whatever one had more votes. This could or could not be a valid sixth algorithm depending on how good the first 5 are.

Matthew
+4  A: 

Choosing the best classifier for a job, or combining various classifiers is a whole separate field. This general article about classification is as good a start as any to learn about choosing the best classifier for a job. And this article about classifier ensembles is a good place to start learning about classifier combining.

To give the basis for an answer to your (quite broad) question: The best classifier for a job depends on several factors:

  • The required quality of classification (in your case this would be high)
  • The allowed complexity of the classification (i.e. can you compute for days to reach an answer of do you have several milliseconds) (time is not a limitation in you case I'd guess)
  • The cost associated with misclassification. This is a very important factor in your case. If you tell people they have cancer when they have not you cause great stress, but (one hopes) further testing (which costs money) will eventually find out they are healthy. On the other hand, if you miss cancer in a patient she could die. This means that the "best" classifier (the one that makes the least errors) may not be the best for your problem.

On that last point: Say 1 in 1000 women has cancer, I have some classifiers:

  1. Misses 20% of cancer cases, and says a healthy woman has cancer in 2% of cases. This classifier would make about 200 errors in a population of 10000 people.
  2. Just say: "This person does not have cancer" in all cases. Only 10 errors in 10000 cases!
  3. Just say "This persion has cancer" it will make 9990 errors in 10000 cases.

The second classifier makes the lowest number of errors, but after a few months of using it people who could have been saved start dying. The third classifier sends everybody on for the next test (which would have the same problem as this one) or maybe it inflicts a useless life changing operation on 9990 healthy people. The second test makes a trade-off. Two people might become very sick or even die, 198 people experience painfull and stressfull procedures and operations for naught. Obviously in your case all classifiers were like classifier 1 with slight variations in the percentages. In those cases you have to make a trade-off between missing cancer cases, and inflicting the rest of the procedure (including cost!) on healthy people. The starting point for research about this trade-off is the receiver-operater-characteristic

jilles de wit