tags:

views:

624

answers:

8

I'm trying to use STL to solve the following problem (I don't want to implement my own data structure if I don't have to). I've come up with a working implementation, but I'm hoping there's something faster...or is what code I have the best way to do it?

I have a large data set in which each entry contains two items: a key and a size. There are multiple entries in the data set with the same key. What I need to know is: for each key, how many of those keys are in the data set, and for each key what is the total size of them. For example, given this data set (key, size):

(1, 3)
(3, 27)
(7, 7)
(3, 2)
(1, 1)

I want to generate this output, sorted by size ascending:

Key 1:  Size 4, Count 2
Key 7:  Size 7, Count 1
Key 3:  Size 29, Count 2

Since the data set is completely unsorted, I first need to aggregate the keys to count them and sum up the size. Then I need to resort that data structure by the total size to produce the final output. This is the code I've come up with to accomplish the task using std::map and std::vector:

struct Node
{
    int Size;
    int Count;

    Node()
        : Size(0), Count(0)
    {
    }

    Node(int size)
        : Size(size), Count(1)
    {
    }
};

void map_insert(std::map<int, Node> &map, int key, int size)
{
    std::map<int, Node>::iterator itr = map.find(key);

    if (itr != map.end())
    {
        itr->second.Count++;
        itr->second.Size += size;
    }
    else
    {
        map[key] = Node(size);
    }
}

bool compare(const std::pair<int, Node> &a1, const std::pair<int, Node> &a2)
{
    return a1.second.Size < a2.second.Size;
}

int _tmain(int argc, _TCHAR* argv[])
{
    std::map<int, Node> _map;

    map_insert(_map, 1, 3);
    map_insert(_map, 3, 27);
    map_insert(_map, 7, 7);
    map_insert(_map, 3, 2);
    map_insert(_map, 1, 1);

    std::vector<std::pair<int, Node>> v(_map.begin(), _map.end());
    std::sort(v.begin(), v.end(), compare);

    return 0;
}

Minus the output code, this produces the correct sort. I hate using two separate data structures, but there doesn't seem to be a way to "resort" a tree based on a different key. Are there any gross inefficiencies here that I can avoid? Can anyone think of a better way to do this?

Note that I'm assuming that using Node instances (instead of Node pointers) is going to be faster than new'ing and delete'ing every node that's used here. Is this a reasonable assumption or do you think new/delete would be faster than copying these small-ish structs?

Edit: Interesting, I had never known about multimap, but using the implementation provided below (thanks Naveen), it looks like Multimap performs worse. (Note my intent here was a fast implementation, memory is not an issue, I should have pointed that out.) Using this implementation:

class Timer
{
public:
    Timer()
        : mStart(0)
    {
    }

    void Start()
    {
        mStart = std::clock();
    }

    double Mark()
    {
        std::clock_t curr = std::clock();
        double f = (curr - mStart)/((double)CLOCKS_PER_SEC);
        mStart = curr;
        return f;
    }

private:
    std::clock_t mStart;
};

struct Node
{
    int Size;
    int Count;

    Node()
        : Size(0), Count(0)
    {
    }

    Node(int size)
        : Size(size), Count(1)
    {
    }


};

void map_insert(std::map<int, Node> &map, int key, int size)
{
    std::map<int, Node>::iterator itr = map.find(key);

    if (itr != map.end())
    {
        itr->second.Count++;
        itr->second.Size += size;
    }
    else
    {
        map[key] = Node(size);
    }
}

bool compare(const std::pair<int, Node> &a1, const std::pair<int, Node> &a2)
{
    return a1.second.Size < a2.second.Size;
}

int make_size(int i, int size_max)
{
    return (7 * i) % size_max;
}

int make_key(int i, int key_max)
{
    return (11 * i) % key_max;
}

void first_impl(int max, int size_max, int key_max)
{
    std::cout << "first_impl:" << std::endl;
    double total = 0;
    double curr = 0;
    Timer t;
    t.Start();

    {
        std::map<int, Node> _map;

        for (int i = 0; i < max; ++i)
            map_insert(_map, make_key(i, key_max), make_size(i, size_max));

        total += curr = t.Mark();
        std::cout << "\tinsert: " << curr << std::endl;

        std::vector<std::pair<int, Node>> v(_map.begin(), _map.end());

        total += curr = t.Mark();
        std::cout << "\tcreate: " << curr << std::endl;

        std::sort(v.begin(), v.end(), compare);

        total += curr = t.Mark();
        std::cout << "\tsort: " << curr << std::endl;
    }

    total += curr = t.Mark();
    std::cout << "\tcleanup: " << curr << std::endl;

    std::cout << "\ttotal: " << total << std::endl;
}

void second_impl(int max, int size_max, int key_max)
{
    std::cout << "second_impl:" << std::endl;
    double total = 0;
    double curr = 0;
    Timer t;
    t.Start();

    {
        std::map<int, Node> res;
        typedef std::multimap<int, int> MultiMap;
        MultiMap mMap;

        for (int i = 0; i < max; ++i)
            mMap.insert(std::make_pair(make_key(i, key_max), make_size(i, size_max)));

        total += curr = t.Mark();
        std::cout << "\tinsert: " << curr << std::endl;

        std::multimap<int, int>::iterator iter = mMap.begin();
        std::multimap<int, int>::iterator endIter = mMap.end();
        for(; iter != endIter; ++iter)
        {
            int val = iter->first;
            if(res.find(val) != res.end())
            {
                    continue;
            }

            std::pair<MultiMap::iterator, MultiMap::iterator> iterPair = mMap.equal_range(val);
            Node n;
            n.Size = val;
            n.Count = mMap.count(val);

            int size = 0;
            for(; iterPair.first != iterPair.second; ++iterPair.first)
            {
                    size += iterPair.first->second;                 
            }

            res[size] = n;
        }
        total += curr = t.Mark();
        std::cout << "\tsort: " << curr << std::endl;
    }

    total += curr = t.Mark();
    std::cout << "\tcleanup: " << curr << std::endl;

    std::cout << "\ttotal: " << total << std::endl;
}

int _tmain(int argc, _TCHAR* argv[])
{
    const int size_max = 31;
    const int key_max = 1019;
    const int max = 1000000;
    first_impl(max, size_max, key_max);
    second_impl(max, size_max, key_max);

    return 0;
}

Results look something like this:

first_impl:
        insert: 0.094
        create: 0
        sort: 0
        cleanup: 0
        total: 0.094
second_impl:
        insert: 1.653
        sort: 46.894
        cleanup: 66.081
        total: 114.628

The second implementation is obviously slower. It looks like the fact that the total number of keys are far lower than the total items (the total number of unique keys being around 1000 is representative of my data set) makes std::map the winner here because it quickly reaches a stable state where no more nodes are needed. I completely missed this fact before I did this secondary investigation.

It looks like my original implementation is better than multimap, and since I'm unwilling to take a dependency on Boost, I think I have my answer. Thanks all!

+2  A: 

Depending on what complexity do you need you have two options.

First one is to use multimap container to keep values and iteratively use equal_range to generate output. Here you'll get fast insert, but slow output.

Second option is to use boost::multi_index with member functions as indices that will calculate sum and count values on insert. Here you'll get slow insert, but fast output.

Kirill V. Lyadvinsky
+2  A: 

Following is the sample code to implement this using std::multimap and std::map and then using equal_range with count method of these classes.

    std::map<int, Node> res;
    typedef std::multimap<int, int> MultiMap;
    MultiMap mMap;
    mMap.insert(std::make_pair(1,3));
    mMap.insert(std::make_pair(3,27));
    mMap.insert(std::make_pair(7,7));
    mMap.insert(std::make_pair(3,2));
    mMap.insert(std::make_pair(1,1));


std::multimap<int, int>::iterator iter = mMap.begin();
std::multimap<int, int>::iterator endIter = mMap.end();
while( iter != endIter)
{
 int val = iter->first;
 std::pair<MultiMap::iterator, MultiMap::iterator> iterPair = mMap.equal_range(val);
 Node n;
 n.val = val;
 n.count = mMap.count(val);

 int size = 0;
 for(; iterPair.first != iterPair.second; ++iterPair.first)
 {
  size += iterPair.first->second;   
 }

 res[size] = n;

 iter = iterPair.second;
}

Node is defined as:

struct Node
{
    int val;
    int count;

    Node() : val(0), count(0)
    {
    }
};

Note that the key for the result map is size and not count.

Naveen
Thank you for the code. This really helped narrow down the fact that map/vector/sort performs better in my specific case. I'd upvote you, but I don't have the rep. =/
LCC
A: 

One thing to note: In VS2008, at least, when you assign map[key] = Node(size);, you actually end up constructing three separate Node instances. The upshot is that only the one you declare gets created on the stack - the other two are created on the heap, so by using this version, you're actually incurring twice the overhead you would if you used a pointer and assumed responsibility for deleting all your instances at the end.

Dathan
+6  A: 

A multimap<> may help you.

There are multiple entries in the data set with the same key.
multimap<> can handle duplicate keys, maps cannot.

How many of each key is there in the data set
multimap<>::count() takes a key and returns the number of matching elements.

For each key what is the total size of them
multimap<>::equal_range() takes a key and returns a std::pair< multimap<>::iterator, multimap<>::iterator >, where the first iterator is the first element matching key and the second is the last. they can be iterated as thought they were begin and end. So using those it would be a simple loop to calculate the total size for each key.

Obviously, it doesn't entirely suit your needs, and if you are going to be operating on large data sets perhaps you would gain some valuable performance implementing a custom container. Good luck!

0xC0DEFACE
Using Naveen's code below it looks as if map/vector/sort is faster due to the fact that my total number of unique keys is low compared to the total number of entries in the data set. Testing this against a multimap implementation gave me more confidence that the original solution I came up with should be good enough for my purposes. Thanks for the info!
LCC
+2  A: 

If you can use Boost you can make use of Boost.Multiindex. It allows you to have a container with two ordered indexes (in your example an index by key and an index by size). As for memory efficiencies or inefficiencies it is said that in Boost.Multiindex ordered indices node compression was implemented and the result is that:

Size of ordered indices node headers have been reduced by 25% on most platforms

Also take a look at this example and its result: Results for 2 ordered indices. So even when you simply use boost::multiindex with ordered indexes it uses less memory than std::multiset from MS VS 8.0 or gcc.

As for your solution I think you can expect that Boost.Multiindex will use less memory comparing to your implementation. However if you want to compare exactly two solutions you can do this. Write your own counting allocator, add it to your containers and find out how much memory has been used. Then do the same thing using Boost.Multiindex with your counting allocator. This is an example of allocator. You need to modify it slightly in order to count number bytes that has benn allocated and deallocated.

skwllsp
+2  A: 

std::map is a associative container so the map will be in sorted order with respect to key. And here since you are using duplicate keys so multimap will solve your purpose.

Vivek
A: 

stl::map<Key, Data, Compare, Alloc> have Compare, just give a function for weak ordering there.

struct Node
{
    int Size;
    int Count;
};

bool compareNode(const Node& a, const Node& b) {
  return a.Size < b.Size;
}

stl::map<Node, stlstring, compareNode>  xxx;
J-16 SDiZ
+1  A: 

Here a small improvement of your solution. Shorter and avoid the second key search when inserting a new one (note that it relies on the 0s of your Node constructor)

void map_insert(std::map<int, Node> &map, int key, int size) {
 Node & n = map[key];
 ++n.Count;
 n.Size+=size;
}

But the optimal way probably depends of the range of your keys. If always small (let's say 1..1000), a simple vector is the best choice. If bigger, a hash_map gives better result, because you doesn't seem to need the key ordering (used by map).

I tested and it seems to give sensible improvement for your ~1000 keys case, but it also depends of your key distribution. You just need to replace std::map by std::hash_map and then fix header stuff. However, std::hash_map may have some portability problem. You could still write your own hashing system, though (and even adapt it to your key distribution).

EDIT: unordered_map seems to be the future standard for hash_map. At least, if fixes deprecation warning on gcc 4.3.

Alink
Ah good call. I missed the better insertion pattern and hash_map/unordered_map. Thanks!
LCC