views:

131

answers:

6

All,

I have been going through a lot of sites that post about the performance of various Collection classes for various actions i.e. adding an element, searching and deleting. But I also notice that all of them provide different environments in which the test was conducted i.e. O.S, memory, threads running etc.

My question is, if there is any site/material that provides the same performance information on best test environment basis? i.e. the configurations should not be an issue or catalyst for poor performance of any specific data structure.

[Updated]: Example, HashSet and LinkedHashSet both have a complexity of O (1) for inserting an element. However, Bruce Eckel' test claims that insertion is going to take more time for LinkedHashSet than for HashSet [http://www.artima.com/weblogs/viewpost.jsp?thread=122295]. So should I still go by the Big-Oh notation ?

+5  A: 

In my opinion, all you need to know about a data structure is the Big-O of the operations on it, not subjective measures from different architectures. Different collections serve different purposes.

Maps are dictionaries
Sets assert uniqueness
Lists provide grouping and preserve iteration order
Trees provide cheap ordering and quick searches on dynamically changing contents that require constant ordering

Edited to include bwawok's statement on the use case of tree structures

Update
From the javadoc on LinkedHashSet

Hash table and linked list implementation of the Set interface, with predictable iteration order.

...

Performance is likely to be just slightly below that of HashSet, due to the added expense of maintaining the linked list, with one exception: Iteration over a LinkedHashSet requires time proportional to the size of the set, regardless of its capacity. Iteration over a HashSet is likely to be more expensive, requiring time proportional to its capacity.

Now we have moved from the very general case of choosing an appropriate data-structure interface to the more specific case of which implementation to use. However, we still ultimately arrive at the conclusion that specific implementations are well suited for specific applications based on the unique, subtle invariant offered by each implementation.

Tim Bender
Overall very true and what I thought as well. My minor comment is that trees(tree map and set I assume) are not that cheap of ordering. If you are going to just make a list of 1000000 items and then look at them sorted, you will be better off with an ArrayList that you sort at the end. The actual use cases of tree map/set are pretty rare, has to be something you add to a lot, and need it sorted at any given point.
bwawok
@bwawok, you are quite right. I have updated my answer to hopefully better reflect your very valid point.
Tim Bender
@Tim: updated my query.
darkie15
Agreed that measurements from different architectures are unlikely to be widely useful, but I would amend "all you need to know is the big-O" to "all you need is to know big-O and to have some idea of the constant factor". Constant factors can be very significant, and there are plenty of cases where simple O(n) algorithms outperform complicated O(1) algorithms for common values of n.
Porculus
@Porculus average run time is important.. but the only time when you normally run into performance issues, is when N is large. So if you focus on big O runtime, you will do fine. The little time you may lose for small N doesn't matter, as small N should run fast anyway.
bwawok
@Porculus, @bwawok is right for the most part. One exception of course is that Java's array sort implementation does switch from mergesort to insertion sort for small values of N (call it Z) and that is significant only because mergesort splits N into X sub-problems of size Z, allowing the benefits of the average runtime of insertion sort as compared to mergesort to be gained X times.... if that makes sense.So average time matters, but only if you know N to be small and the function to be performed a significant multitude of times.
Tim Bender
+5  A: 

What do you need to know about them, and why? The reason that benchmarks show a given JDK and hardware setup is so that they could (in theory) be reproduced. What you should get from benchmarks is an idea of how things will work. For an ABSOLUTE number, you will need to run it vs your own code doing your own thing.

The most important thing to know is the Big O runtime of various collections. Knowing that getting an element out of an unsorted ArrayList is O(n), but getting it out of a HashMap is O(1) is HUGE.

If you are already using the correct collection for a given job, you are 90% of the way there. The times when you need to worry about how fast you can, say, get items out of a HashMap should be pretty darn rare.

Once you leave single threaded land and move into multi-threaded land, you will need to start worrying about things like ConcurrentHashMap vs Collections.synchronized hashmap. Until you are multi threaded, you can just not worry about this kind of stuff and focus on which collection for which use.

Update to HashSet vs LinkedHashSet

I haven't ever found a use case where I needed a Linked Hash Set (because if I care about order I tend to have a List, if I care about O(1) gets, I tend to use a HashSet. Realistically, most code will use ArrayList, HashMap, or HashSet. If you need anything else, you are in a "edge" case.

bwawok
@bwawok: updated my query.
darkie15
@bwawok: LinkedHashSet is for when you want to be able to iterate over the hash set in the order elements were added.
Jason S
@Jason S: Okay I will update to clarify. I have never run across a need for it in my code... if I care about order I tend to use ArrayList. So I guess you will need to care about order AND O(1) gets to want a LinkedHashSet.
bwawok
+3  A: 

Here are my recommendations:

  1. First of all, don't optimize :) Not that I am telling you to design crap software, but just to focus on design and code quality more than premature optimization. Assuming you've done that, and now you really need to worry about which collection is best beyond purely conceptual reasons, let's move on to point 2
  2. Really, don't optimize yet (roughly stolen from M. A. Jackson)
  3. Fine. So your problem is that even though you have theoretical time complexity formulas for best cases, worst cases and average cases, you've noticed that people say different things and that practical settings are a very different thing from theory. So run your own benchmarks! You can only read so much, and while you do that your code doesn't write itself. Once you're done with the theory, write your own benchmark - for your real-life application, not some irrelevant mini-application for testing purposes - and see what actually happens to your software and why. Then pick the best algorithm. It's empirical, it could be regarded as a waste of time, but it's the only way that actually works flawlessly (until you reach the next point).
  4. Now that you've done that, you have the fastest app ever. Until the next update of the JVM. Or of some underlying component of the operating system your particular performance bottleneck depends on. Guess what? Maybe your clients have different ones. Here comes the fun: you need to be sure that your benchmark is valid for others or in most cases (or have fun writing code for different cases). You need to collect data from users. LOTS. And then you need to do that over and over again to see what happens and if it still holds true. And then re-write your code accordingly over and over again (The - now terminated - Engineering Windows 7 blog is actually a nice example of how user data collection helps to make educated decisions to improve user experience.

Or you can... you know... NOT optimize. Platforms and compilers will change, but a good design should - on average - perform well enough.

Other things you can also do:

  • Have a look at the JVM's source code. It's very educative and you discover a herd of hidden things (I'm not saying that you have to use them...)
  • See that other thing on your TODO list that you need to work on? Yes, the one near the top but that you always skip because it's too hard or not fun enough. That one right there. Well get to it and leave the optimization thingy alone: it's the evil child of a Pandora's Box and a Moebius band. You'll never get out of it, and you'll deeply regret you tried to have your way with it.

That being said, I don't know why you need the performance boost so maybe you have a very valid reason.

And I am not saying that picking the right collection doesn't matter. Just that ones you know which one to pick for a particular problem, and that you've looked at alternatives, then you've done your job without having to feel guilty. The collections have usually a semantic meaning, and as long as you respect it you'll be fine.

haylem
Makes sense. Thank you !
darkie15
@darkie15: you're welcome.
haylem
A: 

If I had to sort millions of rows I'd try to find a different way. Maybe I could improve my SQL, improve my algorithm, or perhaps write the elements to disk and use the operating system's sort command.

I've never had a case where collections where the cause of my performance issues.

Tony Ennis
Boy, I have: http://stackoverflow.com/questions/926266/performance-optimization-strategies-of-last-resort/927773#927773
Mike Dunlavey
@Tony: I am sorry but I am not sure what you mean here. I never meant talking about persistence.
darkie15
+4  A: 

The different collection classes have different big-O performances, but all that tells you is how they scale as they get large. If your set is big enough the one with O(1) will outperform the one with O(N) or O(logN), but there's no way to tell what value of N is the break-even point, except by experiment.

Generally, I just use the simplest possible thing, and then if it becomes a "bottleneck", as indicated by operations on that data structure taking much percent of time, then I will switch to something with a better big-O rating. Quite often, either the number of items in the collection never comes near the break-even point, or there's another simple way to resolve the performance problem.

Mike Dunlavey
+1  A: 

Both HashSet and LinkedHashSet have O(1) performance. Same with HashMap and LinkedHashMap (actually the former are implemented based on the later). This only tells you how these algorithms scale, not how they actually perform. In this case, LinkHashSet does all the same work as HashSet but also always has to update a previous and next pointer to maintain the order. This means that the constant (this is an important value also when talking about actual algorithm performance) for HashSet is lower than LinkHashSet.

Thus, since these two have the same Big-O, they scale the same essentially - that is, as n changes, both have the same performance change and with O(1) the performance, on average, does not change.

So now your choice is based on functionality and your requirements (which really should be what you consider first anyway). If you only need fast add and get operations, you should always pick HashSet. If you also need consistent ordering - such as last accessed or insertion order - then you must also use the Linked... version of the class.

I have used the "linked" class in production applications, well LinkedHashMap. I used this in one case for a symbol like table so wanted quick access to the symbols and related information. But I also wanted to output the information in at least one context in the order that the user defined those symbols (insertion order). This makes the output more friendly for the user since they can find things in the same order that they were defined.

Kevin Brock
Got it. Thank you
darkie15