views:

350

answers:

9

Hello,

while profiling a java application that calculates hierarchical clustering of thousands of elements I realized that ArrayList.get occupies like half of the CPU needed in the clusterization part of the execution.

The algorithm searches the two more similar elements (so it is O(n*(n+1)/2) ), here's the pseudo code:

int currentMax = 0.0f
for (int i = 0 to n)
  for (int j = i to n)
    get content i-th and j-th
      if their similarity > currentMax
        update currentMax

merge the two clusters

So effectively there are a lot of ArrayList.get involved.

Is there a faster way? I though that since ArrayList should be a linear array of references it should be the quickest way and maybe I can't do anything since there are simple too many gets.. but maybe I'm wrong. I don't think using a HashMap could work since I need to get them all on every iteration and map.values() should be backed by an ArrayList anyway..

Otherwise should I try other collection libraries that are more optimized? Like google's one, or apache one..

EDIT:

you are somewhat confirming my doubts :(

Would I obtain a boost in perfomance trying to parallelize things? Maybe using a pool of executors that calculate similarity on more than one couple.. but I don't know if synchronization and locks on data structures would end up slowing it down.

Similarity is calculated using the dot product of tag maps of the two contents. The maps are two HashMap<Tag, Float>.. in addition I already cache similarities in a TLongFloatHashMap (from Trove collections) to avoid recalculating it in later iterations in which the Long key is calculated as a hashcode of both contents (that is unique for the pair so that hash(c1, c2) == hash(c2, c1)) so everything else is already tuned up enough.

EDIT2:

I'll post a little bit of the code just to let you understand better.. This is used to calculate the hash used to store the similarity between two elements:

private long computeKey(int h1, int h2) {   
        if (h1 < h2) {
            int swap = h1;
            h1 = h2;
            h2 = swap;
        }           
        return ((long)h1) << 32 | h2;
    }

This is how correlation is calculated:

float correlation(Map<Tag, Float> map1, Map<Tag, Float>map2, HierarchNode n1, HierarchNode n2) {    
        long key = computeKey(n1.hashCode, n2.hashCode);

        if (cache.contains(key)) {
            ++hitCounter;
            return cache.get(key);
        }
        else {      
            float corr = 0.0f;

            Set<Map.Entry<Tag, Float>> entries;
            Map<Tag, Float> curMap;

            if (map1.size() < map2.size()) {
                entries = map1.entrySet();
                curMap = map2;
            }
            else {              
                entries = map2.entrySet();
                curMap = map1;
            }

            for (Map.Entry<Tag, Float> ee : entries) {
                Float f2 = curMap.get(ee.getKey());

                if (f2 != null)
                    corr += ee.getValue()*f2;
            }

            cache.put(key, corr);               
            return corr;
        }
    }

And this is how the algorithm scans the content:

for (int j = 0; j < clusters.size(); ++j) {
                skip = false;

                for (int k = j+1; k < clusters.size(); ++k) {                                   
                    float r = correlation(clusters.get(k).tags, clusters.get(j).tags, clusters.get(k), clusters.get(j));

                    if (r > max) {
                        max = r;
                        i1 = j;
                        i2 = k;
                    }

                    if (max == 1.0f) {
                        skip = true;
                        break;
                    }
                }

                if (skip)
                    break;
            }

I would have used just a matrix to store all the values, but on every iteration the most similar items are removed from the list and a new item is added (which have a new tag map depending on the two chosen)

+2  A: 

The algorithm you have is O(n²). Unless you have a way to make your algorithm do something significantly better than doing pairwise comparisons, performance is unlikely to appreciably improve. :-(

Chris Jester-Young
+2  A: 

At the risk of stating the obvious, you might get some speed up by using this pseudo-code:

int currentMax = 0.0f
for (int i = 0 to n)
  get content i-th
  for (int j = i to n)
    get content j-th
      if their similarity > currentMax
        update currentMax

merge the two clusters

It's still O(n²), though. If you need to compare every element to every other element to find out which pair is closest, you can't beat O(n²).

That said, if you're calling this multiple times, then there's optimization to be found in caching these results in a sortable map.

EDIT: If the similarity is something that is rather simple (e.g., a 1-dimensional value such as height), you might be able to sort the items in the array first, such that element[0] is most similar to element[1] which is most similar to element[0] or element[2]. In that case, you can get a speed up to O(n lg n).

EDIT2: Given your correlation code, your benchmark results are very suspect. I cannot imagine the situation in which those two gets take more time than the call to the correlation code (even assuming the cache is hit the vast majority of the time), which is also called O(n²) times. Also spong makes a very good point about converting these to arrays first if get() is the bottleneck.

Ben Hocking
If the JIT is halfway decent (something I take for granted when I code in Java), it'd already have done this optimisation for you. Still, good to mention. :-) +1
Chris Jester-Young
You _would_ think it would, but while running benchmarks, I always like to test my assumptions!
Ben Hocking
Totally agree with testing any performance factors empirically. :-)
Chris Jester-Young
I like your edit. Too bad I can't upvote you twice. :-P
Chris Jester-Young
This is good, assuming the pseudocode matches the actual code. Note that `someArrayList.get(index)` is quite a lot slower than using `someArray[index]`, so it may be worth considering a `toArray(...)` before looping.
spong
Thanks, Chris. I totally agree with your JIT comments in general, by the way. Until you've benchmarked, it's fine to assume that the JIT will do most optimizing for you.
Ben Hocking
I eat my words about `get()` being as fast as array access (which is different from saying that lists are slower than arrays: they're not, just that `get()` is slow for some reason; iterator access is fast).
Chris Jester-Young
A: 

There are not many complex operations in your code above. Mainly simple number reads/checks/writes. They are amazingly fast.

The issue is that .get() is a function call - it will be much slower compared to a simple +, = or <= operation. If it's too slow for you, you should start to use real arrays or (as stated by the others) optimize your algorithm first.

Marcel J.
No, `get()` is a really simple function that gets inlined by any halfway-decent JIT compiler, and doesn't perform differently to using an array directly.
Chris Jester-Young
I agree that the JIT compiler should be able to solve this. But if it did in this case the profiler (which might be to blame) shouldn't indicate that `.get()` (a simple array-read) is to blame.
Marcel J.
I don't think that the profiler considered `get` the bottleneck per se, just that most of the runtime went there because it gets called so many times. Death by a thousand cuts and all that. :-)
Chris Jester-Young
To add another chapter to this story, spong posted a very convincing benchmark that shows that `ArrayList.get` is actually slower than direct array access (but, somehow, using `ArrayList` iterator is just as fast as direct array access). So, I retract my comments on this matter. :-/
Chris Jester-Young
+2  A: 

ArrayList.get is an if statement followed by an array access. There's not much to optimize there. ArrayList.get occupies half the execution time because you're not doing anything else. The important factor in the time taken is the number of iterations not what's inside the for-loop.

fgb
+2  A: 

There's no such thing as O(n*(n+1)/2). Your algorithm is O(n2). See Plain english explanation of Big O for a more detailed explanation.

Ben is correct: you can reduce the get() calls by getting the i-th element outside the inner loop.

What you're really looking for is something that improves on O(n2) and this requires being able to make additional assumptions about the elements. That depends on what you mean by "similarity".

Two common approaches:

  • Sort the lists and merge. Overall this is O(n log n);
  • Put one list into a Map of some kind that has (near-) constant lookup. This can reduce the algorithm to anywhere between O(n) and O(n log n) depending on the type of Map and the nature of the traversal.

But all this depends on what you mean by "similarity".

cletus
Don't be so much pedantic. Of course it's O(n²) but since I don't calculate all the similarities because values are simmetric I actually always skip calculating similarities of _j_ and _k_ with _j >= k_ but only vice-versa. So of course it's a quadratic complexity but it does _n*(n+1)/2_ iterations. It was just to warn that I had already optimized that thing :) I added some details about similarity by the way..
Jack
A: 

If you're iterating this process, finding each time the next-most-similar pair, you might do well to create a map from i, j pairs to similarity measures - depending on how processor-intensive calculating the similarity is, and how many items you've got, and how much memory you've got.

Carl Manaster
+1  A: 

Algorithmic efficiency aside, you are calling get too many times. Currently get is called (in the order of) 2*size*size times. It should be called size+size*size/2 times. This only changes constants, but it looks to me like you only need to call get about one quarter as much as you are currently.

Try:

for (int j = 0; j < clusters.size(); ++j) {
  skip = false;
  HierarchNode jnode = clusters.get(j);

  for (int k = j+1; k < clusters.size(); ++k) {
    HierarchNode knode = clusters.get(k);
    float r = correlation(knode.tags, jnode.tags, knode, jnode);

    ... etc ...

Depending on the magnitude of clusters.size(), you might be able to further reduce constants by doing:

HierarchNode[] clusterArr = clusters.toArray(new HierarchNode[clusters.size()]);

and then using clusterArr[j] and clusterArr[k] instead of clusters.get(k) etc.

(names mangled slightly to avoid line wrap)

spong
Copying to a naked array is a complete waste of time. Don't do it unless profiling suggests you'll get any speed improvements from it. Here's my simple benchmark to substantiate my statement: http://codepad.org/LZUfDcTL
Chris Jester-Young
Not sure we're looking at the same thing. Your benchmark code uses iterators. The code is also (mostly) timing the performance of `StringBuffer` which swamps the element access. Have a look at this http://codepad.org/5TQnV4Xo
spong
I should also add a warning about the value of microbenchmarks as well I guess :) Agree that it needs to be tested in the actual working code.
spong
@spong: I apologise, you're right about the `StringBuilder` dominating the access time (especially because I mistakenly used `append(byte)` instead of `append(char)`, which caused reallocations to occur to the `StringBuilder` in question. Changing the code to use `append(char)` changed the profile a bit: `List<Integer>` and `Integer[]` took about the same amount of time, but `int[]` was significantly faster. (This could indeed be one argument for arrays: you can have arrays of primitives, but you can't have lists of primitives.) Anyway, looking at your code now. :-)
Chris Jester-Young
@spong: Wow. I totally take back my comment about `get()` being fast. (I both ran your program as well as modified mine to use `get()` in place of the iterator.) I inspected the source code to the `ArrayList` class, looking at both the iterator and the `get()` implementation, and am still scratching my head as to why there's such a huge difference in performance. Wow. You win. +1
Chris Jester-Young
A: 

Local optimizing won't count much compared to the changing of algorithm. We are not sure what you wanted to do at first place and thus we can't give you the best/good answer.

From what I see, it seems you are having a quite lots of elements and each one contains a list of (Tag, Weight)s. So, there are unclear things here:

  • Is "Weight" calculated from another place or it is static for a corresponding Tag?
  • Can we use Integer instead of Float so numeric computing would be faster? BTW, there's bug in your program that you are comparing equally a Float number (I mean max == 1.0f). You should, however, use > or < with a precision range in this case.

If there are "yes", we have some local optimization. But no, please consider the following techniques (which depend on your real data and real problem as well):

  • Will sorting all elements help? This will increase the chance to cut-off calculation in some cases...
  • Use dynamic programming technique, e.g. incrementally update the correlations. You would have to iterate through all elements one every update but save lots of effort later.
  • Parallel your code (runs on several threads) so it takes advantage of multicores environment. If you are worrying about locking, try using lock-free Map/List instead.
instcode
+1  A: 

I got the following idea after reading chapter 6 from http://nlp.stanford.edu/IR-book/information-retrieval-book.html

    public class WHN implements Comparable<WHN>{
        private HierarchNode node;
        private float weight;

        public HierarchNode getNode() {return node;}
        public float getWeight() {return weight;}

        public WHN(HierarchNode node, float weight) {this.node = node;this.weight = weight;}

        public int compareTo(WHN o) {return Float.compare(this.weight, o.weight); }
    }

    Map<Tag,<SortedMap<Float,HierarchNode>> map = new HashMap<Tag,List<WHN>> 
    for (HierarchNode n : cluster){
    for (Map.Entry tw : n.tags.entrySet()){
        Tag tag = tw.getKey();
        Float weight = tw.getValue();
        if (!map.ContainsKey(tag)){
            map.put(tag,new ArrayList<WHN>();
        }
        map.get(tag).add(new WHN(n,weight));
    }
    for(List<WHN> l: map.values()){
        Collections.Sort(l);
    }
}

Then for each node: you could limit the search to the union of the elements with the N highest weights for each tag (called champion lists)

or you could keep a temporal dot product for each node and update the dot product for each tag, but only looping throught the nodes with weight higher than some fraction of the original node weight (you can find the start using Collection.binarySearch)

I suggest you read the rest of the book, as it may contain a better algorithm.

ggf31416