ansaurus

Question

slow implementation of Union and Intersection

Answer 1

A:

This is what I came up with G-C:

import com.google.common.collect.Multiset;
import com.google.common.collect.Multisets;
import com.google.common.collect.Multiset.Entry;
public class MultisetOp {
    public static void main(String[] args) {
        Multiset<Integer> ms1 = Multisets.newHashMultiset(1, 1, 2, 3, 4, 4, 4);
        Multiset<Integer> ms2 = Multisets.newHashMultiset(1, 2, 3, 3,
            4, 5, 5, 5);
        Multiset<Integer> mu = Multisets.newHashMultiset();
        Multiset<Integer> mi = Multisets.newHashMultiset();
        // -------- UNION START -----------
        for (Entry<Integer> e : ms1.entrySet()) {
            int j = ms2.count(e.getElement());
            mu.add(e.getElement(), Math.max(e.getCount(), j));
        }
        for (Entry<Integer> e : ms2.entrySet()) {
            int j = ms1.count(e.getElement());
            if (j == 0) {
                mu.add(e.getElement(), e.getCount());
            }
        }
        // -------- UNION END -----------

        // -------- INTERSECT START -----------
        for (Entry<Integer> e : ms1.entrySet()) {
            int j = ms2.count(e.getElement());
            if (j > 0) {
                mi.add(e.getElement(), Math.min(e.getCount(), j));
            }
        }
        // -------- INTERSECT END -----------

        System.out.printf("Union: %s%n", mu);
        System.out.printf("Intersection: %s%n", mi);
        System.out.printf("Cardinality: %d%n", mu.size() - mi.size());

    }
}

Result:

[1 x 2, 2, 3 x 2, 4 x 3, 5 x 3]
[1, 2, 3, 4]

Not benchmarked. It seems, your cardinality could be computed with two traversals instead of three.

kd304 2009-07-15 15:00:27

Answer 2

A:

I don't understand the purpose of the setView method...seems like you're just returning a copy of yourself but with the multiplicity set to 1 for each key.

For union try this maybe (might not compile):

public Multiset<E> union(Multiset<E> second) {
    Multiset<E> union = new Multiset<E>();
    union.addAll(this);
    union.addAll(second);

    for(E o : union){
        int multiplicity = Math.max (this.multiplicity(o), second.multiplicity(o));
        union.multiplicities.put (o, multiplicity);
    }

    return union;
}

2009-07-15 15:01:31

Answer 3

A:

I think the problem is that you're invoking second.setView() - recreating that set - for every element in this.setView(). Try this instead:

/**
 * provides an intersection of two multisets whereby 
 * the multiplicity of each element is the smaller of the two
 * @param second
 * @return The multiset containing the intersection of two multisets
 */
public Multiset<E> intersect(Multiset<E> second){    

    Multiset<E> intersection = new Multiset<E>();
    Set<E> other = second.setView();
    for(E o : this.setView()){
        if (other.contains(o)) {
            int i = this.multiplicity(o); 
            int j = second.multiplicity(o);
            i = i < j ? i : j;
            for(int c = 0; c < i; c++){
                intersection.add(o);
            }
        }
    }

    return intersection;        
}

Carl Manaster 2009-07-15 15:21:26

surely the optimizer would catch that at compile time?

Robert 2009-07-15 15:34:28

Surely not. It can't know that the second time through the loop, the second multiset will produce the identical set. You're asking the computer to regenerate the set, so it does.

Carl Manaster 2009-07-15 15:38:00

Answer 4

+1 A:

Here's a refactoring of the class. Not necessarily faster - except for not re-running setView() inside the loops - but maybe cleaner in some ways. FWIW.

import java.util.HashMap;
import java.util.HashSet;

public class Multiset<E> extends HashSet<E> {
    private static final long   serialVersionUID = -9013417064272046980L;
    private final HashMap<E, Integer> multiplicities  = new HashMap<E, Integer>();

    public boolean add(E element) {
     return add(element, 1);
    }

    private boolean add(E element, int copies) {
     if (!contains(element))
      multiplicities.put(element, 0);
     int n = multiplicities.get(element);
     multiplicities.put(element, n + copies);
     return super.add(element);
    }

    /**
     * Adds all of the elements of another multiset to this one. This method allows the preservation of multiplicities which would not occur
     * using the superclass's addAll().
     * 
     * @param that
     * @return true if all elements were successfully added
     */
    public boolean addAll(Multiset<E> that) {
     boolean flag = false;
     for (E element : that)
      flag = add(element, that.multiplicity(element));
     return flag;
    }

    /**
     * The set-view of a multiset is the ordinary set of all elements with multiplicity >= 1.
     * 
     * @return all elements that have multiplicity >= 1
     */
    public Multiset<E> setView() {
     Multiset<E> set = new Multiset<E>();
     for (E o : multiplicities.keySet())
      set.add(o);
     return set;
    }

    /**
     * provides a union of two multisets whereby the multiplicity of each element is the larger of the two
     * 
     * @param that
     * @return
     */
    public Multiset<E> union(Multiset<E> that) {
     HashSet<E> both = new HashSet<E>();
     both.addAll(this);
     both.addAll(that);
     Multiset<E> union = new Multiset<E>();
     for (E element : both)
      union.add(element, Math.max(this.multiplicity(element), that.multiplicity(element)));
     return union;
    }

    /**
     * provides an intersection of two multisets whereby the multiplicity of each element is the smaller of the two
     * 
     * @param that
     * @return The multiset containing the intersection of two multisets
     */
    public Multiset<E> intersect(Multiset<E> that) {
     Multiset<E> intersection = new Multiset<E>();
     final Multiset<E> other = that.setView();
     for (E element : this.setView())
      if (other.contains(element))
       intersection.add(element, Math.min(this.multiplicity(element), that.multiplicity(element)));
     return intersection;
    }

    /**
     * The Multiplicity is the number of occurrences of an object in the multiset
     * 
     * @param element
     * @return number of occurrences of o
     */
    public int multiplicity(E element) {
     return contains(element) ? multiplicities.get(element) : 0;
    }

    public int cardinality() {
     int card = 0;
     for (Integer n : multiplicities.values())
      card += n;
     return card;
    }

    /**
     * Measures the similarity between two multisets
     * 
     * @param that
     * @return the cardinality of the difference of A and B
     */
    public int similarityOfMultisets(Multiset<E> that) {
     return union(that).cardinality() - intersect(that).cardinality();
    }
}

Carl Manaster 2009-07-15 16:25:47

Can we performance-test our solutions somehow? We need test data from somewhere.

kd304 2009-07-15 16:47:46

edited original post with better similarity method

Robert 2009-07-15 17:12:18

Answer 5

+2 A:

Performance test result for the first varians of our algorithms:

Robert-Union: 2263374 us
Robert-Intersection: 603134 us
Robert-Similarity: 2926389 us
Carl-Union: 3372 us
Carl-Intersection: 5097 us
Carl-Similarity: 6913 us
David-Union: 5182 us
David-Intersection: 2527 us
David-Similarity: 5270 us

Carl's union beats my union.

Test code here. I did not verify the correctness of the computation output though.

Test code 2 for various set sizes and variances here (JDK 7b59). Results xslx / ods.

kd304 2009-07-15 17:16:14

Can't explain the slowness in my union. Maybe because I use a beta G-C library? I tried to change my union to Carl's way but it even gets slower.

kd304 2009-07-15 17:41:19

+1 for actual profiling - it's the only way to have anything smart to say about performance.

Carl Manaster 2009-07-15 17:41:25

+1 I concur with carl

Robert 2009-07-15 18:30:40

how does my new similarity method compare?

Robert 2009-07-15 18:41:10

I did not test it, but apply your changes to the code I linked and run it. If it is the same as Carl's solution, I would expect similar values.

kd304 2009-07-15 18:48:42

sorry, i'm not just being lazy - I'm now not at work and don't have access to a jdk on this computer

Robert 2009-07-15 18:50:12

I tested your new similarity code, but without a more performant union, its still slow. Before: 2.870.877 us - After: 2.226.680 us, about 500 ms improvement.

kd304 2009-07-15 19:07:00

Answer 6

A:

Consider extending LinkedHashSet which will give you faster iteration through your collection. This will speed up things like union and intersection operations.

z5h 2009-10-30 01:51:24

ansaurus

tags:

views:

answers:

slow implementation of Union and Intersection

related questions