tags:

views:

194

answers:

5

In an application I will have between about 3000 and 30000 strings. After creation (read from files unordered) there will not be many strings that will be added often (but there WILL be sometimes!). Deletion of strings will also not happen often. Comparing a string with the ones stored will occur frequently.

What kind of structure can I use best, a hashtable, a tree (Red-Black, Splay,....) or just on ordered list (maybe a StringArray?) ?

(Additional remark : a link to a good C# implementation would be appreciated as well)

+4  A: 

HashSet is very good for fast insertion and search speeds. Add, Remove and Contains are O(1).

Edit- Add assumes the array does not need to be resized. If that's the case as Noldorin has stated it is O(n).

I used HashSet on a recent VB 6 (I didn't write it) to .NET 3.5 upgrade project where I was iterating round a collection that had child items and each child item could appear in more than one parent item. The application processed a list of items I wanted to send to an API that charges a lot of money per call.

I basically used the HashSet to keep track items I'd already sent to prevent us incurring an unnecessary charge. As the process was invoked several times (it is basically a batch job with multiple commands), I serialized the HashSet between invocations. This worked very well- I had a requirement to reuse as much as the existing code as possible as this had been thoroughly tested. The HashSet certainly performed very fast.

RichardOD
+7  A: 

It sounds like you simply need a hashtable. The HashSet<T> would thus seem to be the ideal choice. (You don't seem to require keys, but Dictionary<T> would be the right option if you did, of course.)

Here's a summary of the time complexities of the different operations on a HashSet<T> of size n. They're partially based off the fact that the type uses an array as the backing data structure.

  • Insertion: Typically O(1), but potentially O(n) if the array needs to be resized.
  • Deletion: O(1)
  • Exists (Contains): O(1) (given ideal hashtable buckets)

Someone correct me if any of these are wrong please. They are just my best guesses from what I know of the implementation/hashtables in general.

Noldorin
Thanks (both RichardOD and Noldorin)
SoftwareTester
+1  A: 

The answers recommending HashSet<T> are spot on if your comparisons are just "is this string present in the set or not". You could even use different IEqualityComparer<string> implementations (probably choosing from the ones in StringComparer) for case-sensitivity etc.

Is this the only type of comparison you need, or do you need things like "where would this string appear in the set if it were actually an ordered list?" If you need that sort of check, then you'll probably want to do a binary search. (List<T> provides a BinarySearch method; I don't know why SortedList and SortedDictionary don't, as both would be able to search pretty easily. Admittedly a SortedDictionary search wouldn't be quite the same as a normal binary search, but it would still usually have similar characteristics I believe.)

As I say, if you only want "in the set or not" checking, the HashSet<T> is your friend. I just thought I'd bring up the rest in case :)

Jon Skeet
SortedList and SortedDictionary use hashtables internally, so why would you want a binary search? Hashtable lookup is (ideally) O(1), as opposed to the O(log n) a binary search offers. Maybe I've misunderstood you slightly?
Noldorin
SortedList and SortedDictionary *don't* use hashtables. (I mean the generic types, of course. I don't know about the non-generic SortedList, but I'd expect it to be the same.) SortedList is just an array of keys and an array of values, and it makes sure they stay in order. SortedDictionary is a binary search tree. Don't get confused by the fact that they implement IDictionary<TKey, TValue>. The point about hashtable lookup is that it *only* gives you present/absent information. A binary search offers the "would-be" position of an absent element. See List<T>.BinarySearch's return value.
Jon Skeet
Yeah, you're right, of course. I was getting muddled with Dictionary<T> for some reason. By the sound of the question, however, the OP isn't looking for the index of any items, though this is still presuming.
Noldorin
On this subject, I'm thinking it really would have made more sense to name Dictionary<T> as Hashtable<T> instead. I am aware that there are some (relatively) subtle difference between the current non-generic Hashtable and generic Dictionary, but the naming convention would nonetheless seem better.
Noldorin
I'm not particularly bothered by Dictionary, but "SortedList" annoys me because it sounds like a list when it's really a map (dictionary) in terms of interface. The problem is that you want to convey two separate ideas: what you can do with it, and how it's implemented. In collections the implementation can make a big difference, so it's worth including in the name - but its overall interface should be important too!
Jon Skeet
I think MS has simply used a naming convention whereby the average developer will recognise the function of the type pretty much by the name. This would seem to be the case with both Dictionary and SortedList. Neither of those names especially bother me, though I'm not overly keen on the inconsistency between Dictionary<T> and Hashtable. (It's certainly cause some confusion in many people.) This is essentially akin to the debate over the value of implementation relative to interface.
Noldorin
How would anyone recognise that a SortedList is actually a map? To me, it sounds like a list of values which makes sure it stays sorted.
Jon Skeet
@Jon: But it can effectively be used for precisely that purpose.
Noldorin
@Noldorin: You have to explicitly not care about the values though. You can't just call Add(foo) (without making foo a Key/Value pair.
Jon Skeet
+1  A: 

If you need to know "where would this string appear in the set if it were actually an ordered list" (as in Jon Skeet's answer), you could consider a trie. This solution can only be used for certain types of "string-like" data, and if the "alphabet" is large compared to the number of strings it can quickly lose its advantages. Cache locality could also be a problem.

This could be over-engineered for a set of only N = 30,000 things that is largely precomputed, however. You might even do better just allocating an array of k * N Optional and filling it by skipping k spaces between each actual thing (thus reducing the probability that your rare insertions will require reallocation, still leaving you with a variant of binary search, and keeping your items in sorted order. If you need precise "where would this string appear in the set", though, this wouldn't work because you would need O(n) time to examine each space before the item checking if it was blank or O(n) time on insert to update a "how many items are really before me" counter in each slot. It could provide you with very fast imprecise indexes, though, and those indexes would be stable between insertions/deletions.

Doug McClean
+2  A: 

If you're looking for real-time performance or optimal memory efficiency I'd recommend a radix tree or explicit suffix or prefix tree. Otherwise I'd probably use a hash.

Trees have the advantage of having fixed bounds on worst case lookup, insertion and deletion times (based on the length of the pattern you're looking up). Hash based solutions have the advantage of being a whole lot easier to code (you get these out of the box in C#), cheaper to construct initially and if properly configured have similar average-case performance. However, they do tend to use more memory and have non-deterministic time lookups, insertions (and depending on the implementation possibly deletions).

patros