views:

1057

answers:

7

Imagine that I need to create a Collection of elements, where order could or could not matter. Effectively all I plan on doing is using the iterator. I notice most of my colleagues using an ArrayList vs LinkedHashSet/HashSet. My question is, if I know that these elements should be unique, should I be using a Set or a List? Effectively it doesn't really make a difference, but doesn't Set more effectively convey that the elements are unique?

I find this to be an interesting question for large enterprise applications for a few reasons: 1) If you can't guarantee the quality of code overall, using a Set can be dangerous. Why? Because equals() & hashcode might be incorrectly overridden and thus using a Set could cause some really nasty issues. 2) Using a List is more resilient to future changes. If duplicates for whatever reason become possible, no need for concern.

Essentially it boils down to: If I know I should expect unique elements, should I favor Set over List in all cases?

Edit: I suppose I'm also asking: Should Set be used to ensure that no duplicates are added, or can it also be used for the sole purpose of illustrating that no duplicates exist for ease of understanding?

A: 

Set if preferable as it will enforce uniqueness and shows you where you are wrong.

You may have some issues when methods are incorrectly overriden but the right choice is not to pray and avoid calling them. Detect bugs and fix them!

Edit : And yes it's clearer when you see a Set, uniques values are needed and even better : unique values are enforced. Never guess/trust about usage of your code ;)

François
Doesn't this suggest that a majority of the time ArrayList is experienced in the wild, it should be a Set of some sort? I can't think of many legitimate, generic, real life use cases in which you need a List with possible duplicates.
BryB
Somethimes I use Set and List together. Set ensures that the elements are unique and List allows me to access the elements via index.
kd304
Use a list when you don't care about duplicates or if you ensure uniqueness yourself (private field, list populated only by your methods with proper checks)
François
+5  A: 

1) is totally bogus. Don't work around bugs, fix them. Therefore, use any Set implementation if order doesn't matter, or SortedSet if order does matter. If elements don't have to be unique(and you should determine that now, and it usually should not ever change), feel free to use a List.

phihag
+2  A: 

If you need to think in unique elements, use Set. But if you don't trust your users to properly implement equals/hashCode, then I suggest you document that if there is something wrong with the iteration, check your equals/hashCode! But it really depends on the use-case of the data model.

kd304
A: 

Consider readability of code as well.

If you expect and want a unique set, then use a "SET" data structure, things will be much more clear in the long run. And thus, this will also promote better coding.

Sev
A: 

Using a Set implementation over a List implementation could degrade performance. When inserting an element in a Set, you need to check that it isn't a duplicate. If you are planning just to use the iterator, use the simplest implementation possible (ArrayList).

I don't think that is a good idea to use a Set just to convey information. If you are adding the items yourself and you can guarantee that no duplicates will be added, it is pointless to use a Set. Use a proper name to convey information about the collection. Also, it is a good idea to expose it through the Collection interface, especially if the callers of your class just need to iterate through the collection.

kgiannakakis
Can someone explain why this was downmodded? This actually makes a lot of sense.
BryB
Why should using a Set degrade performance. e.g. HashSet offers constant time performance on add, remove, contains and size. And how does he guarantee that there are no duplicates in an ArrayList. He must call contains (which is the same as checking indexOf>0) which has to run trough the whole list.
jitter
The context of the answer states that you know no duplicates will be added, in which case the built-in logic to check for duplicates in Set implementations will be slower than simply adding an element to the end of a list (no need to manually call contains()). Yes, both implementations are O(1) but the Set is necessarily slower than the List since it needs to do more.
Andrzej Doyle
I haven't done any tests, but I believe that a HashSet will be slower in adding elements than ArrayList. The guarantee of no duplicates in the ArrayList comes from the way you add the items. If for example you create an ArrayList to hold objects from a database, you know that the elements are unique. If you create a list with PS or country codes, you know that the elements are unique. Will you use a Set implementation just to convey this information?
kgiannakakis
A: 

I don't think either choice should be considered to convey intention - your method ought to be declared to return simply a Collection with an appropriate generic parameter, both for flexibility and because as you've said, consumers of it should be able to just iterate over it without worrying what type it is. This gives the added advantage that if requirements change later, or it turns out that for whatever reason your initial choice was wrong, you need to change the code in just a single place (the initial constructor call).

The intention should rather be specified in the documentation of the method, which should detail whether the collection's iterator will return elements in any particular order, and whether duplicate elements will appear.

And I also agree with the above posts that say your reasoning around point 1) is off - if there are classes with incorrect implementations of equals and/or hashcode that you want to put into a set, you fix them and then use a Set!

Andrzej Doyle
A: 

Someone said that HashSet offers constant time performance on add, remove, contains and size.

The actual statement in the JavaDocs is "This class offers constant time performance for the basic operations (add, remove, contains and size), assuming the hash function disperses the elements properly among the buckets."

This means you might get slow addition times when adding something to the set if its got a poorly implemented hashCode method.

The following code demonstrates what can happen dependent upon your hashCode implementation.

public void testHashSetAddition() {
    for(int mod=10; mod <= 100; mod=mod+10 ) {
        Set s = new HashSet();
        long start = new Date().getTime();
        for(int i=0; i<100000; i++) {
            s.add(new Foo(i % mod));
        }
        long end = new Date().getTime();
        System.out.println("Mod: " + mod + " - " + (end - start) + "ms");
    }
}

class Foo {
    private int hc;
    public Foo(int i) {
        this.hc = i;
    }
    public int hashCode() {
        return hc;
    }
}

The timing results were:

Mod: 10 - 22683ms
Mod: 20 - 14200ms
Mod: 30 - 10486ms
Mod: 40 - 8562ms
Mod: 50 - 7761ms
Mod: 60 - 6740ms
Mod: 70 - 5778ms
Mod: 80 - 5268ms
Mod: 90 - 4716ms
Mod: 100 - 3966ms

Then, doing exactly the same test for an ArrayList:

public void testAddingToArrayList() {
    for(int mod=100; mod >= 10; mod=mod-10 ) {
        List l = new ArrayList();
        long start = new Date().getTime();
        for(int i=0; i<100000; i++) {
            l.add(new Foo(i % mod));
        }
        long end = new Date().getTime();
        System.out.println("Mod: " + mod + " - " + (end - start) + "ms");
    }
}

Gives:

Mod: 100 - 50ms
Mod: 90 - 30ms
Mod: 80 - 40ms
Mod: 70 - 30ms
Mod: 60 - 30ms
Mod: 50 - 40ms
Mod: 40 - 20ms
Mod: 30 - 30ms
Mod: 20 - 30ms
Mod: 10 - 30ms
A_M
You are right, but the hash set is read access optimized, not for write! contains() will be much faster for the hash set.
Arne Burmeister