views:

11848

answers:

9

Description | A Java program to read a text file and print each of the unique words in alphabetical order together with the number of times the word occurs in the text.

The program should declare a variable of type Map<String, Integer> to store the words and corresponding frequency of occurrence. Which concrete type, though? TreeMap<String, Number> or HashMap<String, Number> ?

The input should be converted to lower case.

A word does not contain any of these characters: \t\t\n]f.,!?:;\"()'

Example output |

 Word            Frequency
  a                 1
  and               5
  appearances       1
  as                1
         .
         .
         .

Remark | I know, I've seen elegant solutions to this in Perl with roughly two lines of code. However, I want to see it in Java.

Edit: Oh yeah, it be helpful to show an implementation using one of these structures (in Java).

+18  A: 

TreeMap seems a no-brainer to me - simply because of the "in alphabetical order" requirement. HashMap has no ordering when you iterate through it; TreeMap iterates in the natural key order.

EDIT: I think Konrad's comment may have been suggesting "use HashMap, then sort." This is good because although we'll have N iterations initially, we'll have K <= N keys by the end due to duplicates. We might as well save the expensive bit (sorting) until the end when we've got fewer keys than take the small-but-non-constant hit of keeping it sorted as we go.

Having said that, I'm sticking to my answer for the moment: because it's the simplest way of achieving the goal. We don't really know that the OP is particularly worried about performance, but the question implies that he's concerned about the elegance and brevity. Using a TreeMap makes this incredibly brief, which appeals to me. I suspect that if performance is really an issue, there may be a better way of attacking it than either TreeMap or HashMap :)

Jon Skeet
Jon, don't overestimate sorting. In my experience, sorting even gigantic data sets can be neglected. O(1) insertion beats O(logn) hands down, given enough items. But as always, this is pure speculation without profiling.
Konrad Rudolph
If you can achieve O(1) insertion, then you've got O(n) total insertion time. If you're also suggesting there's no more work involved before you iterate through, you've somehow achieved O(n) sorting of n (relatively arbitrary) items. How does that work?
Jon Skeet
Konrad: I think I now see your point. Editing...
Jon Skeet
+13  A: 

TreeMap beats HashMap because TreeMap is already sorted for you.

However, you might want to consider using a more appropriate data structure, a bag. See Commons Collections - and the TreeBag class:

This has a nice optimised internal structure and API:

bag.add("big")
bag.add("small")
bag.add("big")
int count = bag.getCount("big")

EDIT: The question of HashMap vs TreeMap performance was answered by Jon - HashMap and sort may be quicker (try it!), but TreeBag is easier. The same is true for bags. There is a HashBag as well as a TreeBag. Based on the implementation (uses a mutable integer) a bag should outperform the equivalent plain map of Integer. The only way to know for sure is to test, as with any performance question.

JodaStephen
Yes, but would it be faster to sort once at the end instead of incurring the cost of keeping things sorted the whole time?
Herms
A: 

Depending on what the speed requirements are, you could also use a Trie. But there's no point in implementing one of those if a TreeMap is quick enough.

CAdaker
A: 

Why not use TreeSet?

Same ordering concept as a TreeMap, except it's a Set - which, by definition, is "A collection that contains no duplicate elements".

From your problem description, it sounds as if you need a Set, I don't see what keys and values you are mapping together.

This class implements the Set interface, backed by a TreeMap instance. This class guarantees that the sorted set will be in ascending element order, sorted according to the natural order of the elements (see Comparable), or by the comparator provided at set creation time, depending on which constructor is used.

matt b
The word maps to the count.
erickson
+1  A: 

You can't assign a TreeMap<String,Number> to a variable with the type Map<String,Integer>. Double, Long, etc. can be "put" into a TreeMap<String,Number>. When I "get" a value from a Map<String,Integer>, it must be an Integer.

Completely ignoring any i18n issues, memory constraints, and error handling, here goes:

class Counter {

  public static void main(String... argv)
    throws Exception
  {
    FileChannel fc = new FileInputStream(argv[0]).getChannel();
    ByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
    CharBuffer cb = Charset.defaultCharset().decode(bb);
    Pattern p = Pattern.compile("[^ \t\r\n\f.,!?:;\"()']+");
    Map<String, Integer> counts = new TreeMap<String, Integer>();
    Matcher m = p.matcher(cb);
    while (m.find()) {
      String word = m.group();
      Integer count = counts.get(word);
      count = (count == null) ? 1 : count + 1;
      counts.put(word, count);
    }
    fc.close();
    for (Map.Entry<String, Integer> e : counts.entrySet()) {
      System.out.printf("%s: %d%n", e.getKey(), e.getValue());
    }
  }

}
erickson
Yes, you're right--I'll go back and edit that. What I mean to say is declare it as type Map<String,Integer> instead of Map<String,Number>, then TreeMap<String,Integer> is a subtype of Map<String,Integer>.
jJack
+1  A: 

Hash map should be much faster. You should not choose a container based on how you want the items to be arranged eventually; Just sort the list of (word, frequency)-pairs at the end. There will usually be less such pairs to be sorted than words in the files, so asymptotic (and real) performance with a hash map will be better.

Not necessarily. In the worst case, hash tables have O(n) access time. Whereas self-balancing binary search trees are always O(log n) even in worst case.
newacct
+1  A: 

I would definitely choose a TreeMap:

  • TreeMap automatically sorts new keys on insertion, no sorting afterwards is needed.
  • When a key already exists it has the same performance as a HashMap.

A TreeSet internally uses a TreeMap so why not use TreeMap directly.

A: 

"When a key already exists it has the same performance as a HashMap." - That is just plain wrong. HashMap has O(1) insertion and TreeMap O(n log n). It'll take at least n log n checks to find out if it's in the table!

coderz
A: 

consider the frequency of addition or deletion to the data structure. TreeMap would not be ideal if it is high. Apart from the search for existing entry nLn it also undergoes frequent rebalancing.

on the other hand Hash structures are bit flamboyant on memory (over allocates). If you can bite that bullet then go for hash structure and sort when required.

G Kumar