I have a list of files (stored as c style strings) that I will be performing a search on and I will remove those files that do not match my parameters. What is the best container to use for this purpose? I'm thinking Set as of now. Note the list of files will never be larger than when it is initialized. I'll only be deleting from the container.
I will start by tossing out vector since it is a sequential container. Set, I believe is close to being sequential or hashed. I would avoid that. A doublely-linked list, the stl list is one of these, has two pointers and the node. Basically, to remove an item, it breaks the chain then rejoins the two parts with the pointers.
Elements in a std::set must be unique, so unless the filenames are globally unique this won't suit your needs.
I would probably recommend a std::list.
From SGI:
A
vector
is a Sequence that supports random access to elements, constant time insertion and removal of elements at the end, and linear time insertion and removal of elements at the beginning or in the middle.A
list
is a doubly linked list. That is, it is a Sequence that supports both forward and backward traversal, and (amortized) constant time insertion and removal of elements at the beginning or the end, or in the middle.An
slist
is a singly linked list: a list where each element is linked to the next element, but not to the previous element. That is, it is a Sequence that supports forward but not backward traversal, and (amortized) constant time insertion and removal of elements.Set
is a Sorted Associative Container that stores objects of type Key.Set
is a Simple Associative Container, meaning that its value type, as well as its key type, is Key. It is also a Unique Associative Container, meaning that no two elements are the same.Multiset
is a Sorted Associative Container that stores objects of type Key.Multiset
is a Simple Associative Container, meaning that its value type, as well as its key type, is Key. It is also a Multiple Associative Container, meaning that two or more elements may be identical.Hash_set
is a Hashed Associative Container that stores objects of type Key.Hash_set
is a Simple Associative Container, meaning that its value type, as well as its key type, is Key. It is also a Unique Associative Container, meaning that no two elements compare equal using the Binary Predicate EqualKey.Hash_multiset
is a Hashed Associative Container that stores objects of type Key.Hash_multiset
is a simple associative container, meaning that its value type, as well as its key type, is Key. It is also a Multiple Associative Container, meaning that two or more elements may compare equal using the Binary Predicate EqualKey.
(Some containers have been omitted.)
I would go with hash_set
if all you care to have is a container which is fast and doesn't contain multiple identical keys. hash_multiset
if you do, set
or multiset
if you want the strings to be sorted, or list
or slist
if you want the strings to keep their insertion order.
After you've built your list/set, use remove_if
to filter out your items based on your criteria.
Assuming your search criteria does not depend on the filename (ie. you search for content, file sizes etc.), so you cannot make use of a set, I'd go with a list
. It will take you O(N) to construct the whole list, and O(1) per one delete.
If you wanted to make it even faster, and didn't insist on using ready-made STL containers, I would:
- use a
vector
- delete using false delete, ie. marking an item as deleted
- when the ratio of deleted/all items raises above certain threshold, I would filter the items with
remove_if
This should give you the best space/time/cache performance. (Although you should profile it to be sure)
You can use two lists/vectors/whatever:
using namespace std;
vector<const char *> files;
files.push_back("foo.bat");
files.push_back("bar.txt");
vector<const char *> good_files; // Maybe reserve elements given files.size()?
for(vector<const char *>::const_iterator i = files.begin(); i != files.end(); ++i) {
if(file_is_good(*i)) {
new_files.push_back(*i);
}
}
I would definitely not use a set - you don't need to sort it so no point in using a set. Set is implemented as a self-balancing tree usually, and the self-balancing algorithm is unnecessary in your case.
If you're going to be doing this operation once, I would use a std::vector with remove_if (from <algorithm>), followed by an erase. If you haven't used remove_if before, what it does is go through and shifts all the relevant items down, overwriting the irrelevant ones in the process. You have to follow it with an erase to reduce the size of the vector. Like so:
std::vector<const char*> files;
files.erase(remove_if(files.begin(), files.end(), RemovePredicate()), files.end());
Writing the code to do the same thing with a std::list would be a little bit more difficult if you wanted to take advantage of its O(1) deletion time property. Seeing as you're just doing this one-off operation which will probably take so little time you won't even notice it, I'd recommend doing this as it's the easiest way.
And to be honest, I don't think you'll see that much difference in terms of speed between the std::list and std::vector approaches. The vector approach only copies each value once so it's actually quite fast, yet takes much less space. In my opinion, going up to a std::list and using three times the space is only really justified if you're doing a lot of addition and deletion throughout the entire application's lifetime.