ansaurus

Question

Efficient algorithm for detecting different elements in a collection

Answer 1

A:

Your edit gives good details; thanks,

Based on that I would presume a fairly well-behaved distribution of times (normal, or possibly gamma; depends on how close to zero your times get) for typical responses. Rejecting a sample from this distribution could be as simple as computing a standard deviation and seeing which samples lie more than n stdevs from the mean, or as complex as taking subsets which exclude outliers until your data settles down into a nice heap (e.g. the mean stops moving around 'much').

Now, you have an added wrinkle if you assume that a person who monkeys with one trial will monkey with another. So you're erally trying to discriminate between a person who just happens to be fast (or slow) vs. one who is 'cheating'. You could do something like compute the stdev rank of each score (I forget the proper name for this: if a value is two stdevs above the mean, the score is '2'), and use that as your statistic.

Then, given this new statistic, there are some hypotheses you'll need to test. E.g., my suspicion is that the stdev of this statistic will be higher for cheaters than for someone who is just uniformly faster than other people--but you'd need data to verify that.

Good luck with it!

Alex Feinman 2010-02-24 14:57:38

Thank you. In fact, I think that is what ANOVA (ANalysis Of VAriance) does under the hoods.

Guido 2010-02-24 15:36:14

Right, that thing. Been a while since stats class. So what is your question, then? Where a good ANOVA implementation can be found?

Alex Feinman 2010-02-24 20:08:14

Not really. The real problem is that ANOVA says there are differences, and I can even know if an element X is different than other element Y, but I don't know which one is different.

Guido 2010-02-24 22:09:32

Your distribution is well-behaved. So you can assume the outliers lie at the max or the min. Start pulling the outliers from the dataset, one by one, and recalculate the mean, until it stops moving so much, or until the change in stdev gets small.

Alex Feinman 2010-02-25 14:53:59

Answer 2

A:

You would have to run the paired t-test (or whatever pairwise test you want to implement) and the increment the counts in a hash where the key is the Person and the count is the number times it was different.

I guess you could also have an arrayList that contains people objects. The people object could store their ID and the counts of time they were different. Implement comparable and then you could sort the arraylist by count.

TheSteve0 2010-02-24 20:51:32

Answer 3

A:

If the items in the list were sorted in numerical order, you can walk two lists simultaneously, and any differences can easily be recognized as insertions or deletions. For example

List A    List B
  1         1       // Match, increment both pointers
  3         3       // Match, increment both pointers
  5         4       // '4' missing in list A. Increment B pointer only.

List A    List B
  1         1       // Match, increment both pointers
  3         3       // Match, increment both pointers
  4         5       // '4' missing in list B (or added to A). Incr. A pointer only.

Scott Smith 2010-02-24 21:19:36

Answer 4

+2 A:

Just in case anyone is interested in the final code, using Apache Commons Math to make statistical operations, and Trove to work with collections of primitive types.

It looks for the element(s) with the highest degree (the idea is based on comments made by @Pace and @Aniko, thanks).

I think the final algorithm is O(n^2), suggestions are welcome. It should work for any problem involving one cualitative vs one cuantitative variable, assuming normality of the observations.

import gnu.trove.iterator.TIntIntIterator;
import gnu.trove.map.TIntIntMap;
import gnu.trove.map.hash.TIntIntHashMap;
import gnu.trove.procedure.TIntIntProcedure;
import gnu.trove.set.TIntSet;
import gnu.trove.set.hash.TIntHashSet;

import java.util.ArrayList;
import java.util.List;

import org.apache.commons.math.MathException;
import org.apache.commons.math.stat.inference.OneWayAnova;
import org.apache.commons.math.stat.inference.OneWayAnovaImpl;
import org.apache.commons.math.stat.inference.TestUtils;


public class TestMath {
    private static final double SIGNIFICANCE_LEVEL = 0.001; // 99.9%

    public static void main(String[] args) throws MathException {
        double[][] observations = {
           {150.0, 200.0, 180.0, 230.0, 220.0, 250.0, 230.0, 300.0, 190.0 },
           {200.0, 240.0, 220.0, 250.0, 210.0, 190.0, 240.0, 250.0, 190.0 },
           {100.0, 130.0, 150.0, 180.0, 140.0, 200.0, 110.0, 120.0, 150.0 },
           {200.0, 230.0, 150.0, 230.0, 240.0, 200.0, 210.0, 220.0, 210.0 },
           {200.0, 230.0, 150.0, 180.0, 140.0, 200.0, 110.0, 120.0, 150.0 }
        };

        final List<double[]> classes = new ArrayList<double[]>();
        for (int i=0; i<observations.length; i++) {
            classes.add(observations[i]);
        }

        OneWayAnova anova = new OneWayAnovaImpl();
//      double fStatistic = anova.anovaFValue(classes); // F-value
//      double pValue = anova.anovaPValue(classes);     // P-value

        boolean rejectNullHypothesis = anova.anovaTest(classes, SIGNIFICANCE_LEVEL);
        System.out.println("reject null hipothesis " + (100 - SIGNIFICANCE_LEVEL * 100) + "% = " + rejectNullHypothesis);

        // differences are found, so make t-tests
        if (rejectNullHypothesis) {
            TIntSet aux = new TIntHashSet();
            TIntIntMap fraud = new TIntIntHashMap();

            // i vs j unpaired t-tests - O(n^2)
            for (int i=0; i<observations.length; i++) {
                for (int j=i+1; j<observations.length; j++) {
                    boolean different = TestUtils.tTest(observations[i], observations[j], SIGNIFICANCE_LEVEL);
                    if (different) {
                        if (!aux.add(i)) {
                            if (fraud.increment(i) == false) {
                                fraud.put(i, 1);
                            }
                        }
                        if (!aux.add(j)) {
                            if (fraud.increment(j) == false) {
                                fraud.put(j, 1);
                            }
                        }
                    }           
                }
            }

            // TIntIntMap is sorted by value
            final int max = fraud.get(0);
            // Keep only those with a highest degree
            fraud.retainEntries(new TIntIntProcedure() {
                @Override
                public boolean execute(int a, int b) {
                    return b != max;
                }
            });

            // If more than half of the elements are different
            // then they are not really different (?)
            if (fraud.size() > observations.length / 2) {
                fraud.clear();
            }

            // output
            TIntIntIterator it = fraud.iterator();
            while (it.hasNext()) {
                it.advance();
                System.out.println("Element " + it.key() + " has significant differences");             
            }
        }
    }
}

Guido 2010-02-24 23:01:39

ansaurus

tags:

views:

answers:

Efficient algorithm for detecting different elements in a collection

related questions