views:

692

answers:

6

I'm trying to find a better way to implement these methods, as over very large sets they take a very long time, any ideas?

import java.util.HashMap;
import java.util.HashSet;

public class Multiset<E> extends HashSet<E> {

    private static final long serialVersionUID = -9013417064272046980L;
    private HashMap<E, Integer> multiplicities = new HashMap<E, Integer>();

    @Override
    public boolean add(E element){
        if(multiplicities.containsKey(element)){
            int x = (int) multiplicities.get(element);
            multiplicities.put(element, ++x);
        }else{
            multiplicities.put(element, 1);
        }
        return super.add(element);    
    }

/**
 * Adds all of the elements of another multiset to this one. 
 * This method allows the preservation of multiplicities
 * which would not occur using the superclass's addAll().
 * @param elements
 * @return true if all elements were successfully added
 */
public boolean addAll(Multiset<E> elements) {
    boolean flag = false;
    for(E element : elements){
        for(int i = 0; i < elements.multiplicity(element); i++)
            flag = add(element);
    }
    return flag;
}

/**
 * The set-view of a multiset is the ordinary set of all 
 * elements with multiplicity >= 1.
 * @return all elements that have multiplicity >= 1
 */
public Multiset<E> setView(){
    Multiset<E> set = new Multiset<E>();
    for(E o : multiplicities.keySet()){
        set.add(o);
    }
    return set;
}

/**
 * provides a union of two multisets whereby the multiplicity of each
 * element is the larger of the two
 * @param second
 * @return
 */
public Multiset<E> union(Multiset<E> second){
    Multiset<E> union = new Multiset<E>();
    Multiset<E> join = new Multiset<E>();
    join.addAll(this);
    join.addAll(second);

    for(E o : join){
        int i = this.multiplicity(o); 
        int j = second.multiplicity(o);
        i = i > j ? i : j;
        for(int c = 0; c < i; c++){
            union.add(o);
        }
    }

    return union;
}

/**
 * provides an intersection of two multisets whereby 
 * the multiplicity of each element is the smaller of the two
 * @param second
 * @return The multiset containing the intersection of two multisets
 */
public Multiset<E> intersect(Multiset<E> second){    

    Multiset<E> intersection = new Multiset<E>();
    for(E o : this.setView()){
        if (second.setView().contains(o)) {
            int i = this.multiplicity(o); 
            int j = second.multiplicity(o);
            i = i < j ? i : j;
            for(int c = 0; c < i; c++){
                intersection.add(o);
            }
        }
    }

    return intersection;        
}

/**
 * The Multiplicity is the number of occurrences of an object 
 * in the multiset
 * @param o
 * @return number of occurrences of o
 */
public int multiplicity(E o){

    return (multiplicities.containsKey(o)) ? multiplicities.get(o) : 0;
}

public int cardinality(){
    int card = 0;
    for(Integer i : multiplicities.values()){
        card += i;
    }

    return card;    
 }

/**
 * Measures the similarity between two multisets
 * @param A
 * @param B
 * @return the cardinality of the difference of A and B 
 */
public int similarityOfMultisets(Multiset<E> second){

    Multiset<E> union, intersection; 
    int difference;

    union = union(second);
    intersection = intersect(second);
    difference = union.cardinality() - intersection.cardinality();

    return difference;

}
}

EDIT:

I believe I have found a faster way to calculate the similarityOfMultisets method:

public int similarityOfMultisets(Multiset<E> second){
    int c = 0;
    for(E elem: this.setView()){
        c += Math.min(this.multiplicity(elem), second.multiplicity(elem));
    }   
    Multiset<E> union = this.union(second);
    return union.cardinality() - c;  
}
A: 

This is what I came up with G-C:

import com.google.common.collect.Multiset;
import com.google.common.collect.Multisets;
import com.google.common.collect.Multiset.Entry;
public class MultisetOp {
    public static void main(String[] args) {
        Multiset<Integer> ms1 = Multisets.newHashMultiset(1, 1, 2, 3, 4, 4, 4);
        Multiset<Integer> ms2 = Multisets.newHashMultiset(1, 2, 3, 3,
            4, 5, 5, 5);
        Multiset<Integer> mu = Multisets.newHashMultiset();
        Multiset<Integer> mi = Multisets.newHashMultiset();
        // -------- UNION START -----------
        for (Entry<Integer> e : ms1.entrySet()) {
            int j = ms2.count(e.getElement());
            mu.add(e.getElement(), Math.max(e.getCount(), j));
        }
        for (Entry<Integer> e : ms2.entrySet()) {
            int j = ms1.count(e.getElement());
            if (j == 0) {
                mu.add(e.getElement(), e.getCount());
            }
        }
        // -------- UNION END -----------

        // -------- INTERSECT START -----------
        for (Entry<Integer> e : ms1.entrySet()) {
            int j = ms2.count(e.getElement());
            if (j > 0) {
                mi.add(e.getElement(), Math.min(e.getCount(), j));
            }
        }
        // -------- INTERSECT END -----------

        System.out.printf("Union: %s%n", mu);
        System.out.printf("Intersection: %s%n", mi);
        System.out.printf("Cardinality: %d%n", mu.size() - mi.size());

    }
}

Result:

[1 x 2, 2, 3 x 2, 4 x 3, 5 x 3]
[1, 2, 3, 4]

Not benchmarked. It seems, your cardinality could be computed with two traversals instead of three.

kd304
A: 

I don't understand the purpose of the setView method...seems like you're just returning a copy of yourself but with the multiplicity set to 1 for each key.

For union try this maybe (might not compile):

public Multiset<E> union(Multiset<E> second) {
    Multiset<E> union = new Multiset<E>();
    union.addAll(this);
    union.addAll(second);

    for(E o : union){
        int multiplicity = Math.max (this.multiplicity(o), second.multiplicity(o));
        union.multiplicities.put (o, multiplicity);
    }

    return union;
}
A: 

I think the problem is that you're invoking second.setView() - recreating that set - for every element in this.setView(). Try this instead:

/**
 * provides an intersection of two multisets whereby 
 * the multiplicity of each element is the smaller of the two
 * @param second
 * @return The multiset containing the intersection of two multisets
 */
public Multiset<E> intersect(Multiset<E> second){    

    Multiset<E> intersection = new Multiset<E>();
    Set<E> other = second.setView();
    for(E o : this.setView()){
        if (other.contains(o)) {
            int i = this.multiplicity(o); 
            int j = second.multiplicity(o);
            i = i < j ? i : j;
            for(int c = 0; c < i; c++){
                intersection.add(o);
            }
        }
    }

    return intersection;        
}
Carl Manaster
surely the optimizer would catch that at compile time?
Robert
Surely not. It can't know that the second time through the loop, the second multiset will produce the identical set. You're asking the computer to regenerate the set, so it does.
Carl Manaster
+1  A: 

Here's a refactoring of the class. Not necessarily faster - except for not re-running setView() inside the loops - but maybe cleaner in some ways. FWIW.

import java.util.HashMap;
import java.util.HashSet;

public class Multiset<E> extends HashSet<E> {
    private static final long   serialVersionUID = -9013417064272046980L;
    private final HashMap<E, Integer> multiplicities  = new HashMap<E, Integer>();

    public boolean add(E element) {
     return add(element, 1);
    }

    private boolean add(E element, int copies) {
     if (!contains(element))
      multiplicities.put(element, 0);
     int n = multiplicities.get(element);
     multiplicities.put(element, n + copies);
     return super.add(element);
    }

    /**
     * Adds all of the elements of another multiset to this one. This method allows the preservation of multiplicities which would not occur
     * using the superclass's addAll().
     * 
     * @param that
     * @return true if all elements were successfully added
     */
    public boolean addAll(Multiset<E> that) {
     boolean flag = false;
     for (E element : that)
      flag = add(element, that.multiplicity(element));
     return flag;
    }

    /**
     * The set-view of a multiset is the ordinary set of all elements with multiplicity >= 1.
     * 
     * @return all elements that have multiplicity >= 1
     */
    public Multiset<E> setView() {
     Multiset<E> set = new Multiset<E>();
     for (E o : multiplicities.keySet())
      set.add(o);
     return set;
    }

    /**
     * provides a union of two multisets whereby the multiplicity of each element is the larger of the two
     * 
     * @param that
     * @return
     */
    public Multiset<E> union(Multiset<E> that) {
     HashSet<E> both = new HashSet<E>();
     both.addAll(this);
     both.addAll(that);
     Multiset<E> union = new Multiset<E>();
     for (E element : both)
      union.add(element, Math.max(this.multiplicity(element), that.multiplicity(element)));
     return union;
    }

    /**
     * provides an intersection of two multisets whereby the multiplicity of each element is the smaller of the two
     * 
     * @param that
     * @return The multiset containing the intersection of two multisets
     */
    public Multiset<E> intersect(Multiset<E> that) {
     Multiset<E> intersection = new Multiset<E>();
     final Multiset<E> other = that.setView();
     for (E element : this.setView())
      if (other.contains(element))
       intersection.add(element, Math.min(this.multiplicity(element), that.multiplicity(element)));
     return intersection;
    }

    /**
     * The Multiplicity is the number of occurrences of an object in the multiset
     * 
     * @param element
     * @return number of occurrences of o
     */
    public int multiplicity(E element) {
     return contains(element) ? multiplicities.get(element) : 0;
    }

    public int cardinality() {
     int card = 0;
     for (Integer n : multiplicities.values())
      card += n;
     return card;
    }

    /**
     * Measures the similarity between two multisets
     * 
     * @param that
     * @return the cardinality of the difference of A and B
     */
    public int similarityOfMultisets(Multiset<E> that) {
     return union(that).cardinality() - intersect(that).cardinality();
    }
}
Carl Manaster
Can we performance-test our solutions somehow? We need test data from somewhere.
kd304
edited original post with better similarity method
Robert
+2  A: 

Performance test result for the first varians of our algorithms:

Robert-Union: 2263374 us
Robert-Intersection: 603134 us
Robert-Similarity: 2926389 us
Carl-Union: 3372 us
Carl-Intersection: 5097 us
Carl-Similarity: 6913 us
David-Union: 5182 us
David-Intersection: 2527 us
David-Similarity: 5270 us

Carl's union beats my union.

Test code here. I did not verify the correctness of the computation output though.

Test code 2 for various set sizes and variances here (JDK 7b59). Results xslx / ods.

kd304
Can't explain the slowness in my union. Maybe because I use a beta G-C library? I tried to change my union to Carl's way but it even gets slower.
kd304
+1 for actual profiling - it's the only way to have anything smart to say about performance.
Carl Manaster
+1 I concur with carl
Robert
how does my new similarity method compare?
Robert
I did not test it, but apply your changes to the code I linked and run it. If it is the same as Carl's solution, I would expect similar values.
kd304
sorry, i'm not just being lazy - I'm now not at work and don't have access to a jdk on this computer
Robert
I tested your new similarity code, but without a more performant union, its still slow. Before: 2.870.877 us - After: 2.226.680 us, about 500 ms improvement.
kd304
A: 

Consider extending LinkedHashSet which will give you faster iteration through your collection. This will speed up things like union and intersection operations.

z5h