tags:

views:

142

answers:

5

I have set (s) of unique maps (Java HashMaps currently) and wish to remove from it any maps that are completely contained by some other map in the set (i.e. remove m from s if m.entrySet() is a subset of n.entrySet() for some other n in s.)

I have an n^2 algorithm, but it's too slow. Is there a more efficient way to do this?

Edit:

the set of possible keys is small, if that helps.

Here is an inefficient reference implementation:

public void removeSubmaps(Set<Map> s) {
    Set<Map> toRemove = new HashSet<Map>();
    for (Map a: s) {
        for (Map b : s) {
            if (a.entrySet().containsAll(b.entrySet()))
                toRemove.add(b);
        }
    }
    s.removeAll(toRemove);    
}
+2  A: 

Not sure I can make this anything other than an n^2 algorithm, but I have a shortcut that might make it faster. Make a list of your maps with the length of the each map and sort it. A proper subset of a map must be shorter or equal to the map you're comparing - there's never any need to compare to a map higher on the list.

Mark Ransom
Thanks - I though of this, but it doesn't help much. A typical case I'm looking at right now has 10000 maps of size 3 and 40000 maps of size 4. So I still have to do 400m comparisons. Better than 2500m comparisons, but not better enough...
rattigan
+1  A: 

Here's another stab at it.

Decompose all your maps into a list of key,value,map number. Sort the list by key and value. Go through the list, and for each group of key/value matches, create a permutation of all the map number pairs - these are all potential subsets. When you have the final list of pairs, sort by map numbers. Go through this second list, and count the number of occurrences of each pair - if the number matches the size of one of the maps, you've found a subset.

Mark Ransom
This looks like it should do the trick. I'm going to code it up and see. Nice idea, Mark!
rattigan
I would have coded it up before accepting the answer, but thanks! I should have pointed out that this will have very bad worst case performance. I thought of one more optimization: instead of a list of map number pairs, generate a map using the pairs as the key. Each time you generate a pair, increment the value at that key. No need to sort the result when you're done.
Mark Ransom
Ah, right you are - I've unaccepted for now. I think the problem you're referring to is when the key/value groups are large - this result in up to n^2 pairs being generated per key/value in the worst case.
rattigan
I've added an answer of my own which seems to work well in practice. I'm not sure what the complexity is, but it takes very little time compared to my naive solution. Thanks for pointing me in the direction of indexing by value.
rattigan
A: 

Edit: My original interpretation of the problem was incorrect, here is new answer based on my re-read of the question.

You can create a custom hash function for HashMap which returns the product of all hash value of its entries. Sort the list of hash value and start loop from biggest value and find all divisor from smaller hash values, these are possible subsets of this hashmap, use set.containsAll() to confirm before marking them for removal.

This effectively transforms the problem into a mathematical problem of finding possible divisor from a collection. And you can apply all the common divisor-search optimizations.

Complexity is O(n^2), but if many hashmaps are subsets of others, the actual time spent can be a lot better, approaching O(n) in best-case scenario (if all hashmaps are subset of one). But even in worst case scenario, division calculation would be a lot faster than set.containsAll() which itself is O(n^2) where n is number of items in a hashmap.

You might also want to create a simple hash function for hashmap entry objects to return smaller numbers to increase multiply/division performance.

Bill Yang
It seems like overflow would be a problem here. Both the division and the sort would be broken by overflows. It seems like a bloom filter could be used to achieve something similar.
rattigan
overflow can be avoided by creating a custom hash function for hashmap entry that returns a relatively small number; bloom filter does seem interesting, a slightly modified version might work better, good point!
Bill Yang
A: 

This what I ended up doing. It works well in my situation as there is usually some value that is only shared by a small number of maps. Kudos to Mark Ransom for pushing me in this direction.

In prose: Index the maps by key/value pair, so that each key/value pair is associated with a set of maps. Then, for each map: Find the smallest set associated with one of it's key/value pairs; this set is typically small for my data. Each of the maps in this set is a potential 'supermap'; no other map could be a 'supermap' as it would not contain this key/value pair. Search this set for a supermap. Finally remove all the identified submaps from the original set.

private <K, V>  void removeSubmaps(Set<Map<K, V>> maps) {
    // index the maps by key/value
    List<Map<K, V>> mapList = toList(maps);
    Map<K, Map<V, List<Integer>>> values = LazyMap.create(HashMap.class, ArrayList.class);
    for (int i = 0, uniqueRowsSize = mapList.size(); i < uniqueRowsSize; i++) {
        Map<K, V> row = mapList.get(i);
        Integer idx = i;
        for (Map.Entry<K, V> entry : row.entrySet()) 
            values.get(entry.getKey()).get(entry.getValue()).add(idx);
    }

    // find submaps
    Set<Map<K, V>> toRemove = Sets.newHashSet();
    for (Map<K, V> submap : mapList) {
        // find the smallest set of maps with a matching key/value
        List<Integer> smallestList = null;
        for (Map.Entry<K, V> entry : submap.entrySet()) {
            List<Integer> list = values.get(entry.getKey()).get(entry.getValue());
            if (smallestList  == null || list.size() < smallestList.size())
                smallestList = list;
        }

        // compare with each of the maps in that set
        for (int i : smallestList) {
            Map<K, V> map = mapList.get(i);
            if (isSubmap(submap, map))
                toRemove.add(submap);
        }
    }

    maps.removeAll(toRemove);
}

private <K,V> boolean isSubmap(Map<K, V> submap, Map<K,V> map){
    if (submap.size() >= map.size())
        return false;
    for (Map.Entry<K,V> entry : submap.entrySet()) {
        V other = map.get(entry.getKey());
        if (other == null)
            return false;
        if (!other.equals(entry.getValue()))
            return false;
    }
    return true;
}
rattigan
Hmm, there might be a bug in this if two of your maps are identical: I think both will be removed. Exercise for the reader...
rattigan
A: 

Here's a subquadratic (O(N**2 / log N)) algorithm for finding maximal sets from a set of sets: An Old Sub-Quadratic Algorithm for Finding Extremal Sets.

But if you know your data distribution, you can do much better in average case.

Ants Aasma
Thanks for finding this, it looks to be relevant. At least I know what the algorithm is called now... Unfortunately it's an expensive algorithm. The solution I've added seems to achieve good results for my data.
rattigan