views:

209

answers:

4

For a map where the key represents a number of a sequence and the value the count how often this number appeared in the squence, how would an implementation of an algorithm in java look like to calculate the median?

For example:

1,1,2,2,2,2,3,3,3,4,5,6,6,6,7,7

in a map:

Map<Int,Int> map = ...
map.put(1,2)
map.put(2,4)
map.put(3,3)
map.put(4,1)
map.put(5,1)
map.put(6,3)
map.put(7,2)

double median = calculateMedian(map);
print(median);

would result in:

> print(median);
3
>

So what i am looking for is a java implementation of calculateMedian.

+2  A: 
  • Use a SortedMap, i.e. a TreeMap
  • Iterate through the map once to calculate the total number of elements, i.e. the sum of all occurrences
  • Iterate again and add up occurences until you've reached half of the total. The number that caused the sum to exceed half of the total is the median
  • Test extensively for off-by-one errors
Michael Borgwardt
half the total? Half the total is going to drop you near the element that is almost, but not quite the mean, if you are lucky. If you have 'n' elements in your SortedMap, the median is going to be the element at 'n/2'.
McBeth
Nice approach, but needs a bit more refining... if you have a list 1,2,2,4,4,5 you algorithm would return 2 or 4 depending on insertion order, when the correct median value would be 3.
kasperjj
@McBeth: the mean is not what is wanted. And the median is *not* necessarily the element at n/2 due to the occurrences count
Michael Borgwardt
@kasperjj: Yes, that's a rare special case that would need extra code if you really need to support it.
Michael Borgwardt
+1 as it seems that basically I've written the same thing (I did mention testing for odd/even); also since OP talks about sequences and counts would you really need SortedMap?
Unreason
@Michael it is not mean that kasperjj shows - it is the median, for even number of elements (6 in this case) median is 3rd element (2) + 4th element (4) divided by two = 3.
Unreason
@Unreason: SortedMap allows inserting additional elements efficiently. Otherwise, a list of (number, occurrences) tuples sorted by number would be just as good.
Michael Borgwardt
@Michael, as I said - OP talks about *sequences*, which imho means no sorting is needed as we know the biggest and smallest key in advance; updating the counts (equivalent to inserting on the expanded data) would be done by simply adding to/updating the value associated with a certain key.
Unreason
@Unreason: ah, but what if you need to insert a number that was previously not present at all.
Michael Borgwardt
@Michael, as I said OP talks about *sequences*, which means that you will always have numbers from min..max. So, unless OP edits the question that's what it is (actually the data structure would not break if you are missing a few numbers from a sequence - you could have a count of zero and our algorithms would still work; also expanding or reducing the sequence would work, too). Furthermore, you seem to be assuming that updating data is important (the question does not ask about that); just state that in the answer, this way it sounds as if it is somehow relevant to the original question.
Unreason
lets wait a few month maybe someone will show finally a working example. thanks for the idea Michael, this one i had myself.
Chris
+1  A: 

For in easy but maybe not-so-efficient algorithm I'd do it like this:

1. expand the map to a list.

practically spoken: iterate through the map and add the key 'value-times' to the new list. Finally sort the list.

//...
List<Integer> field = new ArrayList<Integer>();
for (Integer key:map) {
  for (int i = 0; i < map.get(key); i++) {
    field.add(key);
  }
}
Collections.sort(field);

2. calculate the median

now you have to implement a method int calculateMedian(List<Integer> sorted). This depends on the kind of median you need. If it's just the sample median, then the result is either the middlemost value (for lists with an odd number of elements) or the average of the two middlemost values (for lists with an even length). Note, that the list needs to be sorted!

(Ref: Sample Median / wikipedia)


OK, OK, even though Chris didn't mention efficiency, here's an idea how to calculate the sample median (!) without expanding the map...

Set<Integer> sortedKeys = new TreeSet<Integer>(map.keySet()); // just to be sure ;)
Integer median = null;  // Using Integer to have a 'invalid/not found/etc' state
int total = 0;
for (Integer key:sortedKeys) {
  total += map.get(key);
}
if (isOddNumber(total)) { // I don't have to implement everything, do I?
  int counter = total / 2;  // index starting with 0
  for (Integer key:sortedKeys) {
    middleMost -= map.get(key);
    if (counter < 0) {
      // the sample median was in the previous bin
      break;
    }
    median = key;
  }
} else {
  int lower = total/2;
  int upper = lower + 1;
  for (Integer key:sortedKeys) {
    lower -= map.get(key);
    upper -= map.get(key);
    if (lower < 0 && upper < 0) {
      // both middlemost values are in the same bin
      break;
    } else (lower < 0 || upper < 0) {
      // lower is in the previous, upper in the actual bin
      median = (median + key) / 2; // now we need the average
      break;
    }
    median = key;
  }
}

(I have no compiler at hand - if it has to many syntax errors, treat it as pseudo code, please ;) )

Andreas_D
@Andreas: +1 That is how I should do it...
Martijn Courteaux
-1: the point is, I think, that Chris does *not* want to expand the list, since it could be very inefficient indeed.
Michael Borgwardt
I agree with Michael, though the answer is very clear it simply unnecessary expands the list killing a lot of memory while the algorithms that provide solution without expanding the list are quite simple (therefore I simply can't see justification for this).
Unreason
Simple? maybe in terms of lines of code but an algorithm that is based on expanding on a list is much easier to read and understand. AND - now we have a very specialized agorithm for a special map. Reusability of the median calculation part is close to zero.
Andreas_D
Re complexity, I *do* agree your approach is *simpler*, however the second implementation is *not* complicated: two loops and some counting. The simplicity of the first approach has memory requirements are x times higher (x is the average of all the counts; from the sample data in question this is ~2.3, in real case this can be any integer, let us assume 100). Also, if expanding the complexity of the sort grows by log x, loop times grow by x. So the algorithm uses more memory and it is slower.
Unreason
Re reusability, given the fact that x can be any integer, let's assume some bigger value such as 100,000; at the same time imagine a few thousand keys. This especially becomes important if you want to reuse the code as it has to apply to the typical domains of data structures for which you want to reuse it (and in this case I would say that typical scenario to use the map from OP is exactly when you want to save space that would be taken if it was stored as a list, so relatively big values in the argument are not a rare case, but a typical case).
Unreason
+3  A: 

Linear time

If you know the total of the numbers (in your case it is 16) you can go from the beginning or the end of the map and sum up the counts until you get to round(n/2)th element, or in case the sum is even to average of floor(n/2)th and ceil(n/2)th elements = median.

If you don't know the total count you will have to go through all of them at least once.

Sublinear time

If you can decide on the data structure and can do pre-processing see wikipedia on selection algorithm and you might get even sublinear algorithm. You can also get sublinear time if you know something about the distribution of the data.

EDIT: So under assumption that we have a sequence with counts what we can do is

  • while inserting the key -> count pairs maintain another map - key -> running_total
  • this way you will have a structure in which you will be able to get total_count by looking at the last key's running_total
  • and you will be able to do a binary search to locate the element where running total is close to total_count/2

This will double the memory usage, but will give O(log n) performance for median and O(1) for total_count.

Unreason
+1 I actually use this approach some times to calculate the median as no additional sorting is required. If you work on bounded, discrete values (with a low upper bound), you can even bucket-sort (like, create a histogram).
zerm
@Rafał, actually this assumes that access to a key is O(1) and not much else, (OP specified key values to be equal to certain range and I assume no holes => sorted); also it is the `running_total` that is important here, I simply kept the data structure the same as in OP.
Unreason
+3  A: 

Using Guava:

Multiset<Integer> values = TreeMultiset.create();
Collections.addAll(values, 1,1,2,2,2,2,3,3,3,4,5,6,6,6,7,7);

Now the answer to your question is:

return Iterables.get(values, (values.size() - 1) / 2);

Really. That's it. (Or check if size is even and average the two central values, to be precise about it.)

If the counts are particularly large, it would be faster to use the multiset's entrySet and keep a running sum, but the simplest way is usually fine.

Kevin Bourrillion
Of course, in this particular toy example you're probably better off creating and then sorting an `ArrayList` instead of using a `TreeMultiset`, but in real life this might not be memory-friendly.
Kevin Bourrillion
That's pretty slick!
BalusC
and how would an example for a map look like? sorry, but i know how to calculate the median of a simple sequence for that i dont need a framework.
Chris
Why would you want to use a map? The `TreeMultiset` implementation internally uses a map, but presents an API to you that's more appropriate for the kind of thing you're doing. It is not a "simple sequence", it just can appear like one if you want it to.
Kevin Bourrillion