views:

208

answers:

2

A Google Collections Multiset is a set of elements each of which has a count (i.e. may be present multiple times).

I can't tell you how many times I want to do the following

  1. Make a histogram (exactly Multiset)
  2. Get the top N elements by count from the histogram

Examples: top 10 URLs (by # times mentioned), top 10 tags (by # times applied), ...

What is the canonical way to do #2 given a Google Collections Multiset?

Here is a blog post about it, but that code is not quite what I want. First, it returns everything, not just top N. Second, it copies (is it possible to avoid a copy?). Third, I usually want a deterministic sort, i.e. tiebreak if counts are equal. Other nits: it's not static, etc.

+2  A: 

To give another perspective for people to comment on, I'll post a slightly modified version of the blog post I referenced:

package com.blueshiftlab.twitterstream.summarytools;

import com.google.common.collect.ImmutableList;
import com.google.common.collect.Multiset;
import com.google.common.collect.Ordering;
import com.google.common.collect.Multiset.Entry;

public class Multisets {
    // Don't construct one
    private Multisets() {
    }

    public static <T> ImmutableList<Entry<T>> sortedByCount(Multiset<T> multiset) {
        Ordering<Multiset.Entry<T>> countComp = new Ordering<Multiset.Entry<T>>() {
            public int compare(Multiset.Entry<T> e1, Multiset.Entry<T> e2) {
                return e2.getCount() - e1.getCount();
            }
        };
        return countComp.immutableSortedCopy(multiset.entrySet());
    }

    public static <T> ImmutableList<Entry<T>> topByCount(Multiset<T> multiset,
            int max) {
        ImmutableList<Entry<T>> sortedByCount = sortedByCount(multiset);
        if (sortedByCount.size() > max) {
            sortedByCount = sortedByCount.subList(0, max);
        }

        return sortedByCount;
    }
}
dfrankow
If I understand correctly, this solution will copy and sort the entire collection every time you want to retrieve the top N elements. I'm not sure what your requirements are, but the heap-sort-ish solution beats this in both time and space so I'm not sure what the benefit is.
danben
You are optimizing for speed, I am looking for fewest # of lines of code written by me.
dfrankow
I see - that was not clear from your post, especially since you asked about avoiding making a copy.
danben
Sorry about that.
dfrankow
careful, your comparator is sorting by count descending
nimcap
Good point. That is by design, but not called out explicitly. "top N" usually means descending sort.
dfrankow
@dfrankow: I know, but if I called `sortedByCount()` I would expect it to be the other way around :)
nimcap
+1  A: 

I wrote methods with the basic functionality you're asking for, except that they perform copies and lack deterministic tie-breaking logic. They're currently internal to Google, but we may open-source them at some point. This Guava issue has the method signatures.

Their algorithm is similar to the blog post: sorting a list of entries. It would be faster, but more complicated, to use a better selection algorithm.

Jared Levy