I'm trying to use STL to solve the following problem (I don't want to implement my own data structure if I don't have to). I've come up with a working implementation, but I'm hoping there's something faster...or is what code I have the best way to do it?
I have a large data set in which each entry contains two items: a key and a size. There are multiple entries in the data set with the same key. What I need to know is: for each key, how many of those keys are in the data set, and for each key what is the total size of them. For example, given this data set (key, size):
(1, 3)
(3, 27)
(7, 7)
(3, 2)
(1, 1)
I want to generate this output, sorted by size ascending:
Key 1: Size 4, Count 2
Key 7: Size 7, Count 1
Key 3: Size 29, Count 2
Since the data set is completely unsorted, I first need to aggregate the keys to count them and sum up the size. Then I need to resort that data structure by the total size to produce the final output. This is the code I've come up with to accomplish the task using std::map and std::vector:
struct Node
{
int Size;
int Count;
Node()
: Size(0), Count(0)
{
}
Node(int size)
: Size(size), Count(1)
{
}
};
void map_insert(std::map<int, Node> &map, int key, int size)
{
std::map<int, Node>::iterator itr = map.find(key);
if (itr != map.end())
{
itr->second.Count++;
itr->second.Size += size;
}
else
{
map[key] = Node(size);
}
}
bool compare(const std::pair<int, Node> &a1, const std::pair<int, Node> &a2)
{
return a1.second.Size < a2.second.Size;
}
int _tmain(int argc, _TCHAR* argv[])
{
std::map<int, Node> _map;
map_insert(_map, 1, 3);
map_insert(_map, 3, 27);
map_insert(_map, 7, 7);
map_insert(_map, 3, 2);
map_insert(_map, 1, 1);
std::vector<std::pair<int, Node>> v(_map.begin(), _map.end());
std::sort(v.begin(), v.end(), compare);
return 0;
}
Minus the output code, this produces the correct sort. I hate using two separate data structures, but there doesn't seem to be a way to "resort" a tree based on a different key. Are there any gross inefficiencies here that I can avoid? Can anyone think of a better way to do this?
Note that I'm assuming that using Node instances (instead of Node pointers) is going to be faster than new'ing and delete'ing every node that's used here. Is this a reasonable assumption or do you think new/delete would be faster than copying these small-ish structs?
Edit: Interesting, I had never known about multimap, but using the implementation provided below (thanks Naveen), it looks like Multimap performs worse. (Note my intent here was a fast implementation, memory is not an issue, I should have pointed that out.) Using this implementation:
class Timer
{
public:
Timer()
: mStart(0)
{
}
void Start()
{
mStart = std::clock();
}
double Mark()
{
std::clock_t curr = std::clock();
double f = (curr - mStart)/((double)CLOCKS_PER_SEC);
mStart = curr;
return f;
}
private:
std::clock_t mStart;
};
struct Node
{
int Size;
int Count;
Node()
: Size(0), Count(0)
{
}
Node(int size)
: Size(size), Count(1)
{
}
};
void map_insert(std::map<int, Node> &map, int key, int size)
{
std::map<int, Node>::iterator itr = map.find(key);
if (itr != map.end())
{
itr->second.Count++;
itr->second.Size += size;
}
else
{
map[key] = Node(size);
}
}
bool compare(const std::pair<int, Node> &a1, const std::pair<int, Node> &a2)
{
return a1.second.Size < a2.second.Size;
}
int make_size(int i, int size_max)
{
return (7 * i) % size_max;
}
int make_key(int i, int key_max)
{
return (11 * i) % key_max;
}
void first_impl(int max, int size_max, int key_max)
{
std::cout << "first_impl:" << std::endl;
double total = 0;
double curr = 0;
Timer t;
t.Start();
{
std::map<int, Node> _map;
for (int i = 0; i < max; ++i)
map_insert(_map, make_key(i, key_max), make_size(i, size_max));
total += curr = t.Mark();
std::cout << "\tinsert: " << curr << std::endl;
std::vector<std::pair<int, Node>> v(_map.begin(), _map.end());
total += curr = t.Mark();
std::cout << "\tcreate: " << curr << std::endl;
std::sort(v.begin(), v.end(), compare);
total += curr = t.Mark();
std::cout << "\tsort: " << curr << std::endl;
}
total += curr = t.Mark();
std::cout << "\tcleanup: " << curr << std::endl;
std::cout << "\ttotal: " << total << std::endl;
}
void second_impl(int max, int size_max, int key_max)
{
std::cout << "second_impl:" << std::endl;
double total = 0;
double curr = 0;
Timer t;
t.Start();
{
std::map<int, Node> res;
typedef std::multimap<int, int> MultiMap;
MultiMap mMap;
for (int i = 0; i < max; ++i)
mMap.insert(std::make_pair(make_key(i, key_max), make_size(i, size_max)));
total += curr = t.Mark();
std::cout << "\tinsert: " << curr << std::endl;
std::multimap<int, int>::iterator iter = mMap.begin();
std::multimap<int, int>::iterator endIter = mMap.end();
for(; iter != endIter; ++iter)
{
int val = iter->first;
if(res.find(val) != res.end())
{
continue;
}
std::pair<MultiMap::iterator, MultiMap::iterator> iterPair = mMap.equal_range(val);
Node n;
n.Size = val;
n.Count = mMap.count(val);
int size = 0;
for(; iterPair.first != iterPair.second; ++iterPair.first)
{
size += iterPair.first->second;
}
res[size] = n;
}
total += curr = t.Mark();
std::cout << "\tsort: " << curr << std::endl;
}
total += curr = t.Mark();
std::cout << "\tcleanup: " << curr << std::endl;
std::cout << "\ttotal: " << total << std::endl;
}
int _tmain(int argc, _TCHAR* argv[])
{
const int size_max = 31;
const int key_max = 1019;
const int max = 1000000;
first_impl(max, size_max, key_max);
second_impl(max, size_max, key_max);
return 0;
}
Results look something like this:
first_impl:
insert: 0.094
create: 0
sort: 0
cleanup: 0
total: 0.094
second_impl:
insert: 1.653
sort: 46.894
cleanup: 66.081
total: 114.628
The second implementation is obviously slower. It looks like the fact that the total number of keys are far lower than the total items (the total number of unique keys being around 1000 is representative of my data set) makes std::map the winner here because it quickly reaches a stable state where no more nodes are needed. I completely missed this fact before I did this secondary investigation.
It looks like my original implementation is better than multimap, and since I'm unwilling to take a dependency on Boost, I think I have my answer. Thanks all!