ansaurus

Question

Answer 1

+2 A:

The algorithm you have is O(n²). Unless you have a way to make your algorithm do something significantly better than doing pairwise comparisons, performance is unlikely to appreciably improve. :-(

Chris Jester-Young 2010-03-24 00:14:45

Answer 2

+2 A:

At the risk of stating the obvious, you might get some speed up by using this pseudo-code:

int currentMax = 0.0f
for (int i = 0 to n)
  get content i-th
  for (int j = i to n)
    get content j-th
      if their similarity > currentMax
        update currentMax

merge the two clusters

It's still O(n²), though. If you need to compare every element to every other element to find out which pair is closest, you can't beat O(n²).

That said, if you're calling this multiple times, then there's optimization to be found in caching these results in a sortable map.

EDIT: If the similarity is something that is rather simple (e.g., a 1-dimensional value such as height), you might be able to sort the items in the array first, such that element[0] is most similar to element[1] which is most similar to element[0] or element[2]. In that case, you can get a speed up to O(n lg n).

EDIT2: Given your correlation code, your benchmark results are very suspect. I cannot imagine the situation in which those two gets take more time than the call to the correlation code (even assuming the cache is hit the vast majority of the time), which is also called O(n²) times. Also spong makes a very good point about converting these to arrays first if get() is the bottleneck.

Ben Hocking 2010-03-24 00:18:24

If the JIT is halfway decent (something I take for granted when I code in Java), it'd already have done this optimisation for you. Still, good to mention. :-) +1

Chris Jester-Young 2010-03-24 00:22:17

You _would_ think it would, but while running benchmarks, I always like to test my assumptions!

Ben Hocking 2010-03-24 00:24:01

Totally agree with testing any performance factors empirically. :-)

Chris Jester-Young 2010-03-24 00:24:37

I like your edit. Too bad I can't upvote you twice. :-P

Chris Jester-Young 2010-03-24 00:31:10

This is good, assuming the pseudocode matches the actual code. Note that `someArrayList.get(index)` is quite a lot slower than using `someArray[index]`, so it may be worth considering a `toArray(...)` before looping.

spong 2010-03-24 00:41:24

Thanks, Chris. I totally agree with your JIT comments in general, by the way. Until you've benchmarked, it's fine to assume that the JIT will do most optimizing for you.

Ben Hocking 2010-03-24 01:05:35

I eat my words about `get()` being as fast as array access (which is different from saying that lists are slower than arrays: they're not, just that `get()` is slow for some reason; iterator access is fast).

Chris Jester-Young 2010-03-24 05:01:17

Answer 3

A:

There are not many complex operations in your code above. Mainly simple number reads/checks/writes. They are amazingly fast.

The issue is that .get() is a function call - it will be much slower compared to a simple +, = or <= operation. If it's too slow for you, you should start to use real arrays or (as stated by the others) optimize your algorithm first.

Marcel J. 2010-03-24 00:20:34

No, `get()` is a really simple function that gets inlined by any halfway-decent JIT compiler, and doesn't perform differently to using an array directly.

Chris Jester-Young 2010-03-24 00:21:19

I agree that the JIT compiler should be able to solve this. But if it did in this case the profiler (which might be to blame) shouldn't indicate that `.get()` (a simple array-read) is to blame.

Marcel J. 2010-03-24 00:28:50

I don't think that the profiler considered `get` the bottleneck per se, just that most of the runtime went there because it gets called so many times. Death by a thousand cuts and all that. :-)

Chris Jester-Young 2010-03-24 00:29:52

To add another chapter to this story, spong posted a very convincing benchmark that shows that `ArrayList.get` is actually slower than direct array access (but, somehow, using `ArrayList` iterator is just as fast as direct array access). So, I retract my comments on this matter. :-/

Chris Jester-Young 2010-03-24 05:02:35

Answer 4

+2 A:

ArrayList.get is an if statement followed by an array access. There's not much to optimize there. ArrayList.get occupies half the execution time because you're not doing anything else. The important factor in the time taken is the number of iterations not what's inside the for-loop.

fgb 2010-03-24 00:22:24

Answer 5

+2 A:

There's no such thing as O(n*(n+1)/2). Your algorithm is O(n²). See Plain english explanation of Big O for a more detailed explanation.

Ben is correct: you can reduce the get() calls by getting the i-th element outside the inner loop.

What you're really looking for is something that improves on O(n²) and this requires being able to make additional assumptions about the elements. That depends on what you mean by "similarity".

Two common approaches:

Sort the lists and merge. Overall this is O(n log n);
Put one list into a Map of some kind that has (near-) constant lookup. This can reduce the algorithm to anywhere between O(n) and O(n log n) depending on the type of Map and the nature of the traversal.

But all this depends on what you mean by "similarity".

cletus 2010-03-24 00:34:21

Don't be so much pedantic. Of course it's O(n²) but since I don't calculate all the similarities because values are simmetric I actually always skip calculating similarities of _j_ and _k_ with _j >= k_ but only vice-versa. So of course it's a quadratic complexity but it does _n*(n+1)/2_ iterations. It was just to warn that I had already optimized that thing :) I added some details about similarity by the way..

Jack 2010-03-24 00:39:11

Answer 6

A:

If you're iterating this process, finding each time the next-most-similar pair, you might do well to create a map from i, j pairs to similarity measures - depending on how processor-intensive calculating the similarity is, and how many items you've got, and how much memory you've got.

Carl Manaster 2010-03-24 00:36:53

Answer 7

+1 A:

Algorithmic efficiency aside, you are calling get too many times. Currently get is called (in the order of) 2*size*size times. It should be called size+size*size/2 times. This only changes constants, but it looks to me like you only need to call get about one quarter as much as you are currently.

Try:

for (int j = 0; j < clusters.size(); ++j) {
  skip = false;
  HierarchNode jnode = clusters.get(j);

  for (int k = j+1; k < clusters.size(); ++k) {
    HierarchNode knode = clusters.get(k);
    float r = correlation(knode.tags, jnode.tags, knode, jnode);

    ... etc ...

Depending on the magnitude of clusters.size(), you might be able to further reduce constants by doing:

HierarchNode[] clusterArr = clusters.toArray(new HierarchNode[clusters.size()]);

and then using clusterArr[j] and clusterArr[k] instead of clusters.get(k) etc.

(names mangled slightly to avoid line wrap)

spong 2010-03-24 01:35:03

Copying to a naked array is a complete waste of time. Don't do it unless profiling suggests you'll get any speed improvements from it. Here's my simple benchmark to substantiate my statement: http://codepad.org/LZUfDcTL

Chris Jester-Young 2010-03-24 01:48:46

Not sure we're looking at the same thing. Your benchmark code uses iterators. The code is also (mostly) timing the performance of `StringBuffer` which swamps the element access. Have a look at this http://codepad.org/5TQnV4Xo

spong 2010-03-24 04:30:32

I should also add a warning about the value of microbenchmarks as well I guess :) Agree that it needs to be tested in the actual working code.

spong 2010-03-24 04:32:41

@spong: I apologise, you're right about the `StringBuilder` dominating the access time (especially because I mistakenly used `append(byte)` instead of `append(char)`, which caused reallocations to occur to the `StringBuilder` in question. Changing the code to use `append(char)` changed the profile a bit: `List<Integer>` and `Integer[]` took about the same amount of time, but `int[]` was significantly faster. (This could indeed be one argument for arrays: you can have arrays of primitives, but you can't have lists of primitives.) Anyway, looking at your code now. :-)

Chris Jester-Young 2010-03-24 04:47:42

@spong: Wow. I totally take back my comment about `get()` being fast. (I both ran your program as well as modified mine to use `get()` in place of the iterator.) I inspected the source code to the `ArrayList` class, looking at both the iterator and the `get()` implementation, and am still scratching my head as to why there's such a huge difference in performance. Wow. You win. +1

Chris Jester-Young 2010-03-24 04:59:30

Answer 8

A:

Local optimizing won't count much compared to the changing of algorithm. We are not sure what you wanted to do at first place and thus we can't give you the best/good answer.

From what I see, it seems you are having a quite lots of elements and each one contains a list of (Tag, Weight)s. So, there are unclear things here:

Is "Weight" calculated from another place or it is static for a corresponding Tag?
Can we use Integer instead of Float so numeric computing would be faster? BTW, there's bug in your program that you are comparing equally a Float number (I mean max == 1.0f). You should, however, use > or < with a precision range in this case.

If there are "yes", we have some local optimization. But no, please consider the following techniques (which depend on your real data and real problem as well):

Will sorting all elements help? This will increase the chance to cut-off calculation in some cases...
Use dynamic programming technique, e.g. incrementally update the correlations. You would have to iterate through all elements one every update but save lots of effort later.
Parallel your code (runs on several threads) so it takes advantage of multicores environment. If you are worrying about locking, try using lock-free Map/List instead.

instcode 2010-03-24 10:59:48

Answer 9

+1 A:

I got the following idea after reading chapter 6 from http://nlp.stanford.edu/IR-book/information-retrieval-book.html

    public class WHN implements Comparable<WHN>{
        private HierarchNode node;
        private float weight;

        public HierarchNode getNode() {return node;}
        public float getWeight() {return weight;}

        public WHN(HierarchNode node, float weight) {this.node = node;this.weight = weight;}

        public int compareTo(WHN o) {return Float.compare(this.weight, o.weight); }
    }

    Map<Tag,<SortedMap<Float,HierarchNode>> map = new HashMap<Tag,List<WHN>> 
    for (HierarchNode n : cluster){
    for (Map.Entry tw : n.tags.entrySet()){
        Tag tag = tw.getKey();
        Float weight = tw.getValue();
        if (!map.ContainsKey(tag)){
            map.put(tag,new ArrayList<WHN>();
        }
        map.get(tag).add(new WHN(n,weight));
    }
    for(List<WHN> l: map.values()){
        Collections.Sort(l);
    }
}

Then for each node: you could limit the search to the union of the elements with the N highest weights for each tag (called champion lists)

or you could keep a temporal dot product for each node and update the dot product for each tag, but only looping throught the nodes with weight higher than some fraction of the original node weight (you can find the start using Collection.binarySearch)

I suggest you read the rest of the book, as it may contain a better algorithm.

ggf31416 2010-03-25 00:26:24

ansaurus

tags:

views:

answers:

Java: ArrayList bottleneck

related questions