tags:

views:

518

answers:

8

I'm writing a Java program that parses all the words from a text file and then adds them to a HashMap. I need to count how many distinct words are contained in the file. I also need to figure out the highest counted words. The HashMap is comprised of each word mapped to an integer which represents how many times the word occurs.

Is there something like HashMap that will help me sort this?

+1  A: 

It looks like the TreeBag class from the commons collections library might do what you want. It keeps track of how many copies of an object are added to the bag, and sorts them in ascending order of count. To get the highest count item just call the last() method. One thing to be aware of is that the commons collections stuff hasn't been updated to using generics yet, so you might get a ton of compiler warnings using it.

Orclev
Or you can search for some special map in Google Collections, which use generics.
True Soft
Re-read the documentation. I believe the Bag would still be sorted by the "key" or words in this case. Not the count. Can you quote the documentation that suggests otherwise?
z5h
You might be right, I interpreted the description of the last method to mean it returned the item with the largest count, but given the context of the selectable comparator it might simply mean the one with the largest natural order.
Orclev
+4  A: 

The Manual way to do it is as follows:

  • Create a composite WordCount class with word and count fields.
  • Create a Comparator for that class that sorts by count.
  • When you're done filling your HashMap, create a new List of WordCount objects created from values in the HashMap.
  • Sort the List using your comparator.
z5h
Exactly what I had in mind too.
Esko
+3  A: 

You could use a HashMultiset from google-collections:

import com.google.common.collect.*;
import com.google.common.collect.Multiset.Entry;

...

  final Multiset<String> words = HashMultiset.create();
  words.addAll(...);

  Ordering<Entry<String>> byIncreasingCount = new Ordering<Entry<String>>() {
    @Override public int compare(Entry<String> a, Entry<String> b) {
      // safe because count is never negative
      return left.getCount() - right.getCount();
    }
  });

  Entry<String> maxEntry = byIncreasingCount.max(words.entrySet())
  return maxEntry.getElement();

EDIT: oops, I thought you wanted only the single most common word. But it sounds like you want the several most common -- so, you could replace max with sortedCopy and now you have a list of all the entries in order.

To find the number of distinct words: words.elementSet().size()

Kevin Bourrillion
+1: For Google Collections!
Jim Ferrans
A: 
  • YourBean implements Comparable<YourBean>
  • method compareTo : order by number of word
  • TreeMap instead of hashmap
Maxime ARNSTAMM
this answer is very incomplete...
Kevin Bourrillion
A: 

For the count, stuff the words in a Set and count the size when done.

For the highest, iterate over all Entries and hold on to the key with the highest value.

Thorbjørn Ravn Andersen
+2  A: 

If you want to sort the Map by word then TreeMap is the Java built-in answer. You can either make sure your Word objects are Comparable or supply a custom Comparator.

SortedMap<Word,Integer> map = new TreeMap<Word,Integer>();
...
for all words {
    Integer count = map.get(word);
    if (count == null ) count = 0;
    map.put(word, count+1);
}

If you want to sort by frequency then you will be better off doing this after all of the words have been counted. Sorted collections don't take kindly to having their ordering messed up through external changes. Sorting by frequency requires a composite word + count object as others have posted.

PSpeed
A: 

Have you checked out java.util.PriorityQueue? A PriorityQueue is basically a list with a priority mapped to each element (implemented by a non-synchronized priority heap). Every time you read in a new string you could add it in or increase it's priority by 1 if it is already present (logarithmic time). The is present check is in linear time, and in the end this will be really easy to use. To get the numbers that appear with the most frequency just poll() for each one when you are done!

edit The standard PriorityQueue doesn't allow you to edit priority directly since it requires a comparator. You'd be better off with a simple Hash implementation or something like this

dhackner
A: 

Here is a Groovy version of the most popular answer to this question:

List leastCommon(Multiset myMultiset, Integer quantity)
{

    Ordering<Multiset.Entry<String>> byIncreasingCount = new Ordering<Multiset.Entry<String>>() {
      @Override public int compare(Multiset.Entry<String> a, Multiset.Entry<String> b) {
          return a.getCount() - b.getCount() }
    }

    maxIndex = Math.min(quantity, myMultiset.entrySet().size() - 1)
    return byIncreasingCount.sortedCopy(myMultiset.entrySet()).subList(0, maxIndex)

}

List mostCommon(Multiset myMultiset, Integer quantity)
{

    Ordering<Multiset.Entry<String>> byDecreasingCount = new Ordering<Multiset.Entry<String>>() {
      @Override public int compare(Multiset.Entry<String> a, Multiset.Entry<String> b) {
          return b.getCount() - a.getCount() }
    }

    maxIndex = Math.min(quantity, myMultiset.entrySet().size() - 1)
    return byDecreasingCount.sortedCopy(myMultiset.entrySet()).subList(0, maxIndex)

}
Rick Otten