ansaurus

Question

Most frequently repeated numbers in a huge list of numbers

Answer 1

+3 A:

Java handles hashing. You don't need to write a hash function. Just start pushing stuff in the hash map.

Also, if this is something that only needs to run once (or only occasionally), then don't both optimizing. It will be fast enough. Only bother if it's something that's going to run within an application.

Daniel Straight 2009-09-10 00:31:39

I need to make it as efficient as possible. And it will be running as part of a bigger application.

2009-09-10 00:46:44

Answer 2

+6 A:

Edit #2:

Okay, I screwed up my own first rule--never optimize prematurely. The worst case for this is probably using a stock HashMap with a wide range--so I just did that. It still runs in like a second, so forget everything else here and just do that.

And I'll make ANOTHER note to myself to ALWAYS test speed before worrying about tricky implementations.

(Below is older obsolete post that could still be valid if someone had MANY more points than a million)

A HashSet would work, but if your integers have a reasonable range (say, 1-1000), it would be more efficient to create an array of 1000 integers, and for each of your million integers, increment that element of the array. (Pretty much the same idea as a HashMap, but optimizing out a few of the unknowns that a Hash has to make allowances for should make it a few times faster).

You could also create a tree. Each node in the tree would contain (value, count) and the tree would be organized by value (lower values on the left, higher on the right). Traverse to your node, if it doesn't exist--insert it--if it does, then just increment the count.

The range and distribution of your values would determine which of these two (or a regular hash) would perform better. I think a regular hash wouldn't have many "winning" cases though (It would have to be a wide range and "grouped" data, and even then the tree might win.

Since this is pretty trivial--I recommend you implement more than one solution and test speeds against the actual data set.

Edit: RE the comment

TreeMap would work, but would still add a layer of indirection (and it's so amazingly easy and fun to implement yourself). If you use the stock implementation, you have to use Integers and convert constantly to and from int for every increase. There is the indirection of the pointer to the Integer, and the fact that you are storing at least 2x as many objects. This doesn't even count any overhead for the method calls since they should be inlined with any luck.

Normally this would be an optimization (evil), but when you start to get near hundreds of thousands of nodes, you occasionally have to ensure efficiency, so the built-in TreeMap is going to be inefficient for the same reasons the built-in HashSet will.

Bill K 2009-09-10 01:00:23

There is no need to implement tree from scrath, because java already has java.util.TreeMap which uses Red-Black trees.

maykeye 2009-09-10 02:59:07

Answer 3

+2 A:

Why use a hashtable? Just use an array that is the same size as the range of your numbers. Then you don't waste time executing the hashing function. Then sort the values after you're done. O(N log N)

gshauger 2009-09-10 01:04:47

Excessively large numbers might make this inefficient

Joe Philllips 2009-09-10 03:08:04

Answer 4

A:

This is the source for java.lang.Integer.hashCode(), which is the hashing function that will be used if you store your entries as a HashMap<Integer, Integer>:

public int hashCode() {
return value;
}

So in other words, the (default) hash value of a java.lang.Integer is the integer itself.

What is more efficient than that?

matt b 2009-09-10 01:54:38

Answer 5

A:

Allocate an array / vector of the same size as the number of input items you have
Fill the array from your file with numbers, one number per element
Put the list in order
Iterate through the list and keep track of the the top 10 runs of numbers that you have encountered.
Output the top ten runs at the end.

As a refinement on step 4, you only need to step forward through the array in steps equilivent to your 10th longest run. Any run longer than that will overlap with your sampling. If the tenth longest run is 100 elements long, you only need to sample element 100, 200, 300 and at each point count the run of the integer you find there (both forwards and backwards). Any run longer than your 10th longest is sure to overlap with your sampling.

You should apply this optimisation after your 10th run length is very long compared to other runs in the array.

A map is overkill for this question unless you have very few unique numbers each with a large number of repeats.

NB: Similar to gshauger's answer but fleshed out

Tom Leys 2009-09-10 02:00:23

Answer 6

A:

If you have to make it as efficient as possible, use an array of ints, with the position representing the value and the content representing the count. That way you avoid autoboxing and unboxing, the most likely killer of a standard Java collection.

If the range of numbers is too large then take a look at PJC and its IntKeyIntMap implementations. It will avoid the autoboxing as well. I don't know if it will be fast enough for you, though.

Yishai 2009-09-10 02:09:51

Answer 7

A:

The correct way to do it is with a linked list. When you insert an element, you go down the linked list, if its there you increment the nodes count, otherwise create a new node with count of 1. After you inserted each element, you would have a sorted list of elements in O(n*log(n)).

For your methods, you are doing n inserts and then sorting in O(n*log(n)), so your coefficient on the complexity is higher.

twolfe18 2009-09-10 02:18:51

You'd have to traverse potentially the entire list each time you looked up a value, unless you knew the input was sorted.

Shin 2009-09-10 02:44:43

You are suggesting what is essentially an insertion sort which is O(n^2). I don't know where you get the log from, but you usually need a 'divide and conquer' approach to have a logarithmic execution time.

Dolphin 2009-09-10 14:01:58

well, I put the `log(n)` in there because I assumed that the distribution of numbers is pretty skewed, but you are correct, in the worse case it is `O(n^2)`. If the distribution of numbers is REALLY skewed, you can even do better than `O(n*log(n))`.

twolfe18 2009-09-10 14:47:01

Using a linked list to store the counts has a worst-case complexity of roughly O(n^2). Best case would be O(n). This is strictly worse that either the hash table or tree suggestions.

Mark Bessey 2009-09-10 16:37:52

You just said it best case O(n) (which is correct). O(n) is better than O(n*log(n)) (complexity of other solutions), so it should be considered and is not strictly worse.

twolfe18 2009-09-10 17:44:39

Answer 8

A:

If the range of numbers is small (e.g. 0-1000), use an array. Otherwise, use a HashMap<Integer, int[]>, where the values are all length 1 arrays. It should be much faster to increment a value in an array of primitives than create a new Integer each time you want to increment a value. You're still creating Integer objects for the keys, but that's hard to avoid. It's not feasible to create an array of 2^31-1 ints, after all.

If all of the input is normalized so you don't have values like 01 instead of 1, use Strings as keys in the map so you don't have to create Integer keys.

Adam Crume 2009-09-10 02:33:36

Answer 9

+3 A:

HashMap

A million integers is not really a lot, even for interpreted languages, but especially for a speedy language like Java. You'll probably barely even notice the execution time. I'd try this first and move to something more complicated if you deem this too slow.

It will probably take longer to do string splitting and parsing to convert to integers than even the simplest algorithm to find frequencies using a HashMap.

Shin 2009-09-10 02:40:55

Answer 10

A:

Use a HashMap to create your dataset (value-count pairs) in memory as you traverse the file. The HashMap should give you close to O(1) access to the elements while you create the dataset (technically, in the worst case HashMap is O(n)). Once you are done searching the file, use Collections.sort() on the value Collection returned by HashMap.values() to create a sorted list of value-count pairs. Using Collections.sort() is guaranteed O(nLogn). For example:

public static class Count implements Comparable<Count> {
 int value;
 int count;
 public Count(int value) {
  this.value = value;
  this.count = 1;
 }
 public void increment() {
  count++;
 }
 public int compareTo(Count other) {
  return other.count - count;
 }
}

public static void main(String args[]) throws Exception {
 Scanner input = new Scanner(new FileInputStream(new File("...")));
 HashMap<Integer, Count> dataset = new HashMap<Integer, Count>();
 while (input.hasNextInt()) {
  int tempInt = input.nextInt();
  Count tempCount = dataset.get(tempInt);
  if (tempCount != null) {
   tempCount.increment();
  } else {
   dataset.put(tempInt, new Count(tempInt));
  }
 }

 List<Count> counts = new ArrayList<Count>(dataset.values());
 Collections.sort(counts);

Tim Bender 2009-09-10 09:08:26

Answer 11

A:

Actually, there is an O(n) algorithm for doing exactly what you want to do. Your use case is similar to an LFU cache where the element's access count determines whether it syays in the cache or is evicted from it.

http://dhruvbird.blogspot.com/2009/11/o1-approach-to-lfu-page-replacement.html

dhruvbird 2010-09-14 17:20:00

ansaurus

tags:

views:

answers:

Most frequently repeated numbers in a huge list of numbers

related questions