views:

551

answers:

6

Hi,

I'm trying to optimize a piece of software which is basically running millions of tests. These tests are generated in such a way that there can be some repetitions. Of course, I don't want to spend time running tests which I already ran if I can avoid it efficiently.

So, I'm thinking about using a Bloom filter to store the tests which have been already ran. However, the Bloom filter errs on the unsafe side for me. It gives false positives. That is, it may report that I've ran a test which I haven't. Although this could be acceptable in the scenario I'm working on, I was wondering if there's an equivalent to a Bloom filter, but erring on the opposite side, that is, only giving false negatives.

I've skimmed through the literature without any luck.

A: 

No and if you think about it, it wouldn't be very useful. In your case you couldn't be sure that your test run would ever stop, because if there are always 'false negatives' there will always be tests that need to be run...

I would say you just have to use a hash.

f3lix
Thanks for your reply.I think it's still useful as I can always stop after a fixed amount of time. In fact I can keep on generating tests forever. But such a data structure will help me to ensure that most tests are in fact new ones without running out of memory fast.
+3  A: 

Is it possible to store the tests that you did not run? This should inverse the filter's behavior.

Tobiesque
You can't do that because it's impossible to take elements out of a Bloom filter.
RarrRarrRarr
A counting bloom filter could work here
Andy Lynch
A: 

How about an LRUCache?

A: 

I'm sorry I'm not much help - I don't think its possible. If test execution can't be ordered maybe use a packed format (8 tests per byte!) or a good sparse array library for storing the outcomes in memory.

Einstein
A: 

I think you're leaving out part of the solution; to avoid false positives entirely you will still have to track which have run, and essentially use the bloom filter as a shortcut to determine the a test definitely has not been run.

That said, since you know the number of tests in advance, you can size the filter in such a way as to provide an acceptable error rate using some well-known formulae; for a 1% probability of returning a false positive you need ~9.5 bits/entry, so for one million entries 1.2 megabytes is sufficient. If you reduce the acceptable error rate to 0.1%, this only increases to 1.8 MB.

The Wikipedia article Bloom Filters gives a great analysis, and an interesting overview of the maths involved.

Andy Lynch
+1  A: 

Yes, a lossy hash table or a LRUCache is a data structure with fast O(1) lookup that will only give false negatives -- if you ask if "Have I run test X", it will tell you either "Yes, you definitely have", or "I can't remember".

Forgive the extremely crude pseudocode:

setup_test_table():
    create test_table( some large number of entries )
    clear each entry( test_table, NEVER )
    return test_table

has_test_been_run_before( new_test_details, test_table ):
    index = hash( test_details , test_table.length )
    old_details = test_table[index].detail
    // unconditionally overwrite old details with new details, LRU fashion.
    // perhaps some other collision resolution technique might be better.
    test_table[index].details = new_test_details
    if ( old_details === test_details ) return YES
    else if ( old_details === NEVER ) return NEVER
    else return PERHAPS    

main()
    test_table = setup_test_table();
    loop
        test_details = generate_random_test()
        status = has_test_been_run_before( test_details, test_table )
        case status of
           YES: do nothing;
           NEVER: run test (test_details);
           PERHAPS: if( rand()&1 ) run test (test_details);
    next loop
end.
David Cary