views:

5746

answers:

17

I hope this question is not considered too basic for this forum, but we'll see. I'm wondering how to refactor some code for better performance that is getting run a bunch of times.

Say I'm creating a word frequency list, using a Map (probably a HashMap), where each key is a String with the word that's being counted and the value is an Integer that's incremented each time a token of the word is found.

In Perl, incrementing such a value would be trivially easy:

$map{$word}++;

But in Java, it's much more complicated. Here the way I'm currently doing it:

int count = map.containsKey(word) ? map.get(word) : 0;
map.put(word, count + 1);

Which of course relies on the autoboxing feature in the newer Java versions. I wonder if you can suggest a more efficient way of incrementing such a value. Are there even good performance reasons for eschewing the Collections framework and using a something else instead?

Update: I've done a test of several of the answers. See below.

+1  A: 

I think your solution would be the standard way, but - as you noted yourself - it is probably not the fastest way possible.

You may look at GNU Trove. That is a library which contains all sorts of fast primitive Collections. Your example would use a TObjectIntHashMap which has a method adjustOrPutValue which does exactly what you want.

jrudolph
A: 

The various primitive wrappers, e.g., Integer are immutable so there's really not a more concise way to do what you're asking unless you can do it with something like AtomicLong. I can give that a go in a minute and update. BTW, Hashtable is a part of the Collections Framework.

Hank Gay
A: 

Hi,

There are a couple of approaches:

  1. Use a Bag alorithm like the sets contained in Google Collections.

  2. Create mutable container which you can use in the Map:


    class My{
        String word;
        int count;
    }

And use put("word", new My("Word") ); Then you can check if it exists and increment when adding.

Avoid rolling your own solution using lists, because if you get innerloop searching and sorting, your performance will stink. The first HashMap solution is actually quite fast, but a proper like that found in Google Collections is probably better.

Counting words using Google Collections, looks something like this:



    HashMultiset s = new HashMultiset();
    s.add("word");
    s.add("word");
    System.out.println(""+s.count("word") );


Using the HashMultiset is quite elegent, because a bag-algorithm is just what you need when counting words.

tovare
A: 

jrudolph, thanks for the pointer to GNU Trove. I'll take a look at it.

Hank, my bad; I recognize that Hashtable was retrofitted to be part of the Collections Framework. I suppose the correct question is which implementation of Map to use. Since I don't need synchronization, maybe HashMap is the best choice.

gregory
A: 

The ObjectKeyIntMap From pcj might be what you're after.

If the map has no value for a key, the default is used, so you could say

map.put( key, map.get(key) + 1 );

I think Trove is a better bet though.

daveb
+9  A: 

@Hank Gay

As a follow-up to my own (rather useless) comment: Trove looks like the way to go. If, for whatever reason, you wanted to stick with the standard JDK, ConcurrentMap and AtomicLong can make the code a tiny bit nicer, though YMMV.

    final ConcurrentMap<String, AtomicLong> map = new ConcurrentHashMap<String, AtomicLong>();
    map.putIfAbsent("foo", new AtomicLong(0));
    map.get("foo").incrementAndGet();

will leave 1 as the value in the map for foo. Realistically, increased friendliness to threading is all that this approach has to recommend it.

Hank Gay
+7  A: 

Another way would be creating a mutable integer:

class MutableInt {
  int value = 0;
  public void inc () { ++value; }
  public int get () { return value; }
}
...
Map<String,MutableInt> map = new HashMap<String,MutableInt> ();
MutableInt value = map.get (key);
if (value == null) {
  value = new MutableInt ();
  map.put (key, value);
} else {
  value.inc ();
}

of course this implies creating an additional object but the overhead in comparison to creating an Integer (even with Integer.valueOf) should not be so much.

You don't want to start the MutableInt off at 1 the first time you put it in the map?
Tom Hawtin - tackline
+1  A: 

Instead of calling containsKey() it is faster just to call map.get and check if the returned value is null or not.

 Integer count = map.get(word);
 if(count == null){
  count = 0;
 }
 map.put(word, count + 1);
Glever
+7  A: 
Aleksandar Dimitrov
A: 

I'd use Apache Collections Lazy Map (to initialize values to 0) and use MutableIntegers from Apache Lang as values in that map.

Biggest cost is having to serach the map twice in your method. In mine you have to do it just once. Just get the value (it will get initialized if absent) and increment it.

jb
+1  A: 

Are you sure that this is a bottleneck? Have you done any performance analysis?

Try using the NetBeans profiler (its free and built into NB 6.1) to look at hotspots.

Finally, a JVM upgrade (say from 1.5->1.6) is often a cheap performance booster. Even an upgrade in build number can provide good performance boosts. If you are running on Windows and this is a server class application, use -server on the command line to use the Server Hotspot JVM. On Linux and Solaris machines this is autodetected.

A: 

Use AtomicLong and LazyMap (from Jakarta Commons Collections) to automatically create instances of it:

    Map map = LazyMap.decorate(new HashMap(), new Factory() {
        public Object create() {
            return new AtomicLong();
        }
    });

    AtomicLong counter = (AtomicLong) map.get("myword");
    System.out.println(counter.incrementAndGet());
Vilmantas Baranauskas
Seconding Alex Miller's comment about thread-safety. If thread-safety were important, you'd be far better off using the concurrency built-ins provided by the JDK, e.g., AtomicLong.
Hank Gay
A: 

@Vilmantas Baranauskas: Regarding this answer, I would comment if I had the rep points, but I don't. I wanted to note that the Counter class defined there is NOT thread-safe as it is not sufficient to just synchronize inc() without synchronizing value(). Other threads calling value() are not guaranteed to see the the value unless a happens-before relationship has been established with the update.

Alex Miller
If you want to reference someone's answer, use @[Username] at the top, e.g., @Vilmantas Baranauskas<Content goes here>
Hank Gay
I made that modification to clean it up.
Alex Miller
+1  A: 

Memory rotation may be an issue here, since every boxing of an int larger than or equal to 128 causes an object allocation (see Integer.valueOf(int)). Although the garbage collector very efficiently deals with short-lived objects, performance will suffer to some degree.

If you know that the number of increments made will largely outnumber the number of keys (=words in this case), consider using an int holder instead. Phax already presented code for this. Here it is again, with two changes (holder class made static and initial value set to 1):

static class MutableInt {
  int value = 1;
  void inc() { ++value; }
  int get() { return value; }
}
...
Map<String,MutableInt> map = new HashMap<String,MutableInt>();
MutableInt value = map.get(key);
if (value == null) {
  value = new MutableInt();
  map.put(key, value);
} else {
  value.inc();
}

If you need extreme performance, look for a Map implementation which is directly tailored towards primitive value types. jrudolph mentioned GNU Trove.

By the way, a good search term for this subject is "histogram".

volley
+4  A: 

It's always a good idea to look at the Google Collections Library for this kind of thing. In this case a Multiset will do the trick:

Multiset bag = Multisets.newHashMultiset();
String word = "foo";
bag.add(word);
bag.add(word);
System.out.println(bag.count(word)); // Prints 2

There are Map-like methods for iterating over keys/entries, etc. Internally the implementation currently uses a HashMap<E, AtomicInteger>, so you will not incur boxing costs.

Chris Nokleberg
If only Sets had a get method....sigh.
Michael Deardeuff
+22  A: 
gregory
Great work, well done. A minor comment - the putIfAbsent() call in the AtomicLong code will instantiate a new AtomicLong(0) even if one is already in the map. If you tweak this to use if (map.get(key) == null) instead, you will probably get an improvement in those test results.
Leigh Caldwell
Thanks for pointing that out, Leigh!
gregory
I did the same thing recently with an approach similar to MutableInt. I'm glad to hear it was the optimal solution (I just assumed it was, without doing any tests).
Kip
Nice to hear that you're faster than me, Kip. ;-) Let me know if you discover any drawbacks to that approach.
gregory
I wanted to add that the increment function in Trove is really just a convenient function. You should look into the map.apply(procedure) methods in trove too. The basic problem is autoboxing and the double lookup required, where the apply method comes from a functional programming (e.g. lisp) design
Josh
Although the MutableInt increment method is very fast, it it not thread-safe in general since the ++ operator is not atomic.
Apocalisp
That's a very good point, Apocalisp. Thanks.
gregory
A: 

The Functional Java library's TreeMap datastructure has an update method in the latest trunk head:

public TreeMap<K, V> update(final K k, final F<V, V> f)

Example usage:

import static fj.data.TreeMap.empty;
import static fj.function.Integers.add;
import static fj.pre.Ord.stringOrd;
import fj.data.TreeMap;

public class TreeMap_Update
  {public static void main(String[] a)
    {TreeMap<String, Integer> map = empty(stringOrd);
     map = map.set("foo", 1);
     map = map.update("foo", add.f(1));
     System.out.println(map.get("foo").some());}}

This program prints "2".

Apocalisp