views:

1679

answers:

9

I am exploring the HashSet<T> type, but I don't understand where it stands in collections.

Can one use it to replace a List<T>? I imagine the performance of a HashSet<T> to be better, but I couldn't see individual access to its elements.

Is it only for enumeration?

+5  A: 

HashSet is a set implemented by hashing. A set is a collection of values containing no duplicate elements. The values in a set are also typically unorderd. So no, a set can not be used to replace a list (unless you should've use a list in the first place).

If you're wondering what a set might be good for: anywhere you want to get rid of duplicates, obviously. As a slightly contrived example, let's say you have a list of 10.000 revisions of a software projects, and you want to find out how many people contributed to that project. You could use a Set<string> and iterate over the list of revisions and add each revision's author to the set. Once you're done iterating, the size of the set is the answer you were looking for.

earl
But Set doesn't allow retrieval of single elements? Like set[45]?
Joan Venge
For that, you'd iterate over the the members the set. Other typical operations are checking if the set contains an element or getting the size of the set.
earl
+5  A: 

Performance would be a bad reason to choose HashSet over List. Instead, what better captures your intent? If order is important, then Set (or HashSet) is out. If duplicates are permitted, likewise. But there are plenty of circumstances when we don't care about order, and we'd rather not have duplicates - and that's when you want a Set.

Carl Manaster
+2  A: 

HashSet<T> is a data strucutre in the .NET framework that is a capable of representing a mathematical set as an object. In this case, it uses hash codes (the GetHashCode result of each item) to compare equality of set elements.

A set differs from a list in that it only allows one occurrence of the same element contained within it. HashSet<T> will just return false if you try to add a second identical element. Indeed, lookup of elements is very quick (O(1) time), since the internal data structure is simply a hashtable.

If you're wondering which to use, note that using a List<T> where HashSet<T> is appropiate is not the biggest mistake, though it may potentially allow problems where you have undesirable duplicate items in your collection. What is more, lookup (item retrieval) is vastly more efficient - ideally O(1) (for perfect bucketing) instead of O(n) time - which is quite important in many scenarios.

Noldorin
Adding an existing item to a set will not throw an exception. Add will simply return false. Also: technically hash lookup is O(n), not O(1), unless you have a perfect hashing function. Of course in practice you'll get away with assuming it's O(1) unless the hashing function is really bad.
sepp2k
@sepp2k: Yeah, so it returns a boolean... The point is, it notifies you. And hash look up is *worst case* O(n) if you're bucketing is terrible - it's much closer to O(1) in general.
Noldorin
+2  A: 

List<T> is used to store ordered sets of information. If you know the relative order of the elements of the list, you can access them in constant time. However, to determine where an element lies in the list or to check if it exists in the list, the lookup time is linear. On the other hand, HashedSet<T> makes no guarantees of the order of the stored data and consequently provides constant access time for its elements.

As the name implies, HashedSet<T> is a data structure that implements set semantics. The data structure is optimized to implement set operations (i.e. Union, Difference, Intersect), which can not be done as efficiently with the traditional List implementation.

So, to choose which data type to use really depends on what your are attempting to do with your application. If you don't care about how your elements are ordered in a collection, and only want to enumarate or check for existence, use HashSet<T>. Otherwise, consider using List<T> or another suitable data structure.

Steve Guidi
Another caveat: sets generally allow only one occurrence of an element.
Steve Guidi
+2  A: 

Probably the most common use for hashsets is to see whether they contain a certain element, which is close to an O(1) operation for them (assuming a sufficiently strong hashing function), as opposed to lists for which check for inclusion is O(n) (and sorted sets for which it is O(log n)). So if you do a lot of checks, whether an item is contained in some list, hahssets might be a performance improvement. If you only ever iterate over them, there won't be much difference (iterating over the whole set is O(n), same as with lists and hashsets have somewhat more overhead when adding items).

And no, you can't index a set, which would not make sense anyway, because sets aren't ordered. If you add some items, the set won't remember which one was first, and which second etc.

sepp2k
If you only iterate over them then the HashSet method adds quite a bit of memory usage compared to the List.
highphilosopher
+1  A: 

In short - anytime you are tempted to use a Dictionary (or a Dictionary where S is a property of T) then you should consider a HashSet (or HashSet + implementing IEquatable on T which equates on S)

Addys
Unless you care about the key, then you should use the dictionary.
Hardwareguy
+5  A: 

The important thing about a HashSet<T> is right there in the name: it's a set. The only things you can do with a single set are to establish what its members are, and to check to see if an item is a member.

Asking if you can retrieve a single element (e.g. set[45]) is misunderstanding the concept of the set. There's no such thing as the 45th element of a set. Items in a set have no ordering. The sets {1, 2, 3} and {2, 3, 1} are identical in every respect because they have the same membership, and membership is all that matters.

It's somewhat dangerous to iterate over a HashSet<T> because doing so imposes an order on the items in the set. That order is not really a property of the set. You should not rely on it. If ordering of the items in a collection is important to you, that collection isn't a set.

Sets are really limited. On the other hand, they're really fast.

Robert Rossney
+5  A: 

Here's a real example of where I use a HashSet<string>:

Part of my syntax highlighter for UnrealScript files is a new feature that highlights Doxygen-style comments. I need to be able to tell if a @ or \ command is valid to determine whether to show it in gray (valid) or red (invalid). I have a HashSet<string> of all the valid commands, so whenever I hit a @xxx token in the lexer, I use validCommands.Contains(tokenText) as my O(1) validity check. I really don't care about anything except existence of the command in the set of valid commands. Lets look at the alternatives I faced:

  • Dictionary<string, ?>: What type do I use for the value? The value is meaningless since I'm just going to use ContainsKey. Note: Before .NET 3.0 this was the only choice for O(1) lookups - HashSet<T> was added for 3.0 and extended to implement ISet<T> for 4.0.
  • List<string>: If I keep the list sorted, I can use BinarySearch, which is O(log n) (didn't see this fact mentioned above). However, since my list of valid commands, this will never be more appropriate than simply:
  • string[]: Again, Array.BinarySearch gives O(log n) performance. If the list is short, this could be the best performing option. It always has less space overhead than HashSet, Dictionary, or List. Even with BinarySearch, it's not faster for large sets, but for small sets it'd be worth experimenting. Mine has several hundred items though, so I passed on this.
280Z28
+3  A: 

A HashSet<T> implements the ICollection<T> interface:

public interface ICollection<T> : IEnumerable<T>, IEnumerable
{
    // Methods
    void Add(T item);
    void Clear();
    bool Contains(T item);
    void CopyTo(T[] array, int arrayIndex);
    bool Remove(T item);

    // Properties
   int Count { get; }
   bool IsReadOnly { get; }
}

A List<T> implements IList<T>, which extends the ICollection<T>

public interface IList<T> : ICollection<T>
{
    // Methods
    int IndexOf(T item);
    void Insert(int index, T item);
    void RemoveAt(int index);

    // Properties
    T this[int index] { get; set; }
}

A HashSet has set semantics, implemented via a hashtable internally:

A set is a collection that contains no duplicate elements, and whose elements are in no particular order.

What does the HashSet gain, if it loses index/position/list behavior?

Adding and retrieving items from the HashSet is always by the object itself, not via an indexer, and close to an O(1) operation (List is O(1) add, O(1) retrieve by index, O(n) find/remove).

A HashSet's behavior could be compared to using a Dictionary<TKey,TValue> by only adding/removing keys as values, and ignoring dictionary values themselves. You would expect keys in a dictionary not to have duplicate values, and that's the point of the "Set" part.

kek444