ansaurus

Question

How to keep only the last duplicate when iterating through rows

Answer 1

+2 A:

I'd go for a std::map of docids. If you could create an appropriate hashing function, a hash-map would be preferable. But I guess it's too difficult. And no - the std::map ist not too slow. Access is O(log n), which is nearly as good as O(1). O(1) is array access time (and Hashmap btw).

Btw, if std::map is too slow, qsort O(n log n) is too slow as well. And, using a std::map and iterating over it's contents, you can perhaps save your qsort.

Some additions for the comment (by onebyone):

I did not go for the implementation details, since there wasn't enough information on that.
qsort may behave bad with sorted data (depending on the implementation). Std::map may not. This is a real advantage, especially if you read the values from a database that might output them ordered by key.
There was no word on the memory allocation strategy. Changing to a memory allocator with fast allocation of small objects may improve the performance.
Still - the fastest would be a hash map with an appropriate hash function. Since there's not enough information about the distribution of the keys, presenting one in this answer is not possible.

Short - if you ask general questions, you get general answers. This means - at least for me, looking at the time complexity in the O-Notation. Still you were right, depending on different factors, the std::map may be too slow while qsort is still fast enough - it may also be the other way round in the worst case of qsort, where it has n^2 complexity.

Tobias Langner 2009-10-30 14:34:39

"if std::map is too slow, qsort O(n log n) is too slow as well". That's something to benchmark. I have always found sorting by insertion into a std::map to be quite slow by comparison with array-based sorts, presumably due to the overhead of allocating tree nodes. So while it's all n log(n), I would never rule out the possibility that for some purposes `std::map` is too slow but some other sort/uniquifying algorithm might be fast enough.

Steve Jessop 2009-10-30 14:44:46

Comments on the additions. std::map is likely to behave relatively badly on already-sorted data, since IIRC always adding to the far right of a red-black tree provokes the max number of rebalancing ops. qsort won't be best on sorted data, but will not be atrocious unless the implementation is very naive (in which case get a better one). Introsort avoids the N^2 worst case: std::sort and qsort really ought not to be just a naive quicksort. However, stable_sort is required to be worst-case O(n((log N)^2)), so by your criteria is faster than qsort ;-)

Steve Jessop 2009-11-02 10:41:45

On reflection, actually I think qsort will be best on already-sorted data, provided that it picks its pivot by a median-of-3 of the first, last, and middle element in the range to be partitioned. This is the very first optimisation that anyone ever makes to a basic quicksort algorithm. So although a qsort implementation is permitted by the spec to be this bad, it really would be criminally poor. It's *allowed* to be O(N^27) worst case according to the spec, but that doesn't mean you code on the assumption that it will be...

Steve Jessop 2009-11-02 10:47:14

Seems like qsort and std::sort are both amazingly fast. The scores are not ordered (i would not calc them, if they were), but near random if i have to guess (which is good for quicksort). I think allocating memory and scanning through it (memory bandwidth) is the bottleneck in my usecase.

ephes 2009-11-13 21:45:41

Answer 2

+3 A:

The first thing I'd try would be a map or unordered_map: I'd be surprised if performance is a factor of 60 slower than what you did without any unique-ification. If the performance there isn't acceptable, another option is something like this:

// get the computed data into a vector
std::vector<score_pair>::size_type count = 0;
std::vector<score_pair> scores;
scores.reserve(num_rows);
while ((row = data.next_row())) {
    float score = calc_score(data.next_feature())
    scores.push_back(score_pair(score, row->docid));
}

assert(scores.size() <= num_rows);

// remove duplicate doc_ids
std::reverse(scores.begin(), scores.end());
std::stable_sort(scores.begin(), scores.end(), docid_cmp);
scores.erase(
    std::unique(scores.begin(), scores.end(), docid_eq),
    scores.end()
);

// order by score
std::sort(scores.begin(), scores.end(), score_cmp);

Note that the use of reverse and stable_sort is because you want the last score for each doc_id, but std::unique keeps the first. If you wanted the first score you could just use stable_sort, and if you didn't care what score, you could just use sort.

The best way of handling this is probably to pass reverse iterators into std::unique, rather than a separate reverse operation. But I'm not confident I can write that correctly without testing, and errors might be really confusing, so you get the unoptimised code...

Edit: just for comparison with your code, here's how I'd use the map:

std::map<int, float> scoremap;
while ((row = data.next_row())) {
    scoremap[row->docid] = calc_score(data.next_feature());
}
std::vector<score_pair> scores(scoremap.begin(), scoremap.end());
std::sort(scores.begin(), scores.end(), score_cmp);

Note that score_pair would need a constructor taking a std::pair<int,float>, which makes it non-POD. If that's not acceptable, use std::transform, with a function to do the conversion.

Finally, if there is much duplication (say, on average 2 or more entries per doc_id), and if calc_score is non-trivial, then I would be looking to see whether it's possible to iterate the rows of data in reverse order. If it is, then it will speed up the map/unordered_map approach, because when you get a hit for the doc_id you don't need to calculate the score for that row, just drop it and move on.

Steve Jessop 2009-10-30 15:08:16

Wow, just tested it and it worked really well. Performance is about the same (17 seconds) despite the double sort and memory usage remains constant. Solves my problem, thank you.

ephes 2009-10-30 16:04:43

+1 for avoiding map - nice and smooth

digitalarbeiter 2009-10-30 16:08:46

If this is running at basically the same speed as your original code, then I'd put that down to std::sort running faster than qsort thanks to inlining.

Steve Jessop 2009-10-30 16:33:47

Found the time to try out the simplified map-version. Well, it still takes about 3 times longer and uses 1.5Gb additional memory (the 4.5Gb number from above was wrong, because 3Gb came from the mmapped datafile...). So, for now i'm sticking with the double-sort vector solution.

ephes 2009-11-13 18:03:55

Finally tried unordered_map (after resolving someimport/interface problems)#include <tr1/unordered_map>typedef tr1::unordered_map<int, float> ScoreMap;ScoreMap scoremap;scoremap.insert(make_pair(row->docid, score));And it worked out well. Runtime is about 19 seconds, which is only 2s slower than the doublesort-vector solution, and memory overhead is only about 300Mb... cool. Since the source is much simpler, i'll going to use unordered_map, thanks a lot!

ephes 2009-11-13 21:24:25

Answer 3

A:

~~Why not sort by doc id first, calculate scores, then for any subset of duplicates use the max score?~~

On re-reading the question; I'd suggest a modification to how scores are read in. Keep in mind C++ isn't my native language, so this won't quite be compilable.

unsigned count = 0;
pair<int, score_pair>* scores = new pair<int, score_pair>[num_rows];
while ((row = data.next_row())) {
 float score = calc_score(data.next_feature())
 scores[count].second.score = score;
 scores[count].second.doc_id = row->docid;
 scores[count].first = count;
 count++;
}

assert(count <= num_rows);
qsort(scores, count, sizeof(score_pair), pair_docid_cmp);

//getting number of unique scores
int scoreCount = 0;
for(int i=1; i<num_rows; i++) 
    if(scores[i-1].second.docId != scores[i].second.docId) scoreCount++; 
score_pair* actualScores=new score_pair[scoreCount];

int at=-1;
int lastId = -1;
for(int i=0; i<num_rows; i++)
{ 
    //if in first entry of new doc id; has the last read time by pair_docid_cmp
    if(lastId!=scores[i].second.docId) 
        actualScores[++at]=scores[i].second;
}
qsort(actualScores, count, sizeof(score_pair), score_cmp);

Where pair_docid_cmp would compare first on docid; grouping same docs together, then second by reverse order read; such that the last item read is the first in the sublist of items with the same docid. Should only be ~5/2x memory usage, and ~double the execution speed.

CoderTao 2009-10-30 15:27:17

Hmm, since qsort is not stable, the position of doc_id will be lost?

ephes 2009-10-30 16:10:04

I misread the question. There's a similar solution to what I have here, where in you store the index of the time read along with the score data; and when you go through to sort by doc id; you would also simultaneously sort on when the score was read in; using that when the docId's are the same. It would again add a bit to memory usage; but should still only be about double the current execution speed. Roughly.

CoderTao 2009-10-30 18:51:56

That and the accepted answer does roughly similar in much cleaner code :)

CoderTao 2009-10-30 19:03:48

Answer 4

+1 A:

Unless I've misunderstood the question, the solution can be simplified considerably. At least as I understand it, you have a few million docid's (which are of type unsigned int) and for each unique docid, you want to store one 'score' (which is a float). If the same docid occurs more than once in the input, you want to keep the score from the last one. If that's correct, the code can be reduced to this:

std::map<unsigned, float> scores;

while ((row = data.next_row())) 
    scores[row->docid] = calc_score(data.next_feature());

This will probably be somewhat slower than your original version since it allocates a lot of individual blocks rather than one big block of memory. Given your statement that there are a lot of duplicates in the docid's, I'd expect this to save quite a bit of memory, since it only stores data for each unique docid rather than for every row in the original data.

If you wanted to optimize this, you could almost certainly do so -- since it uses a lot of small blocks, a custom allocator designed for that purpose would probably help quite a bit. One possibility would be to take a look at the small-block allocator in Andrei Alexandrescu's Loki library. He's done more work on the problem since, but the one in Loki is probably sufficient for the task at hand -- it'll almost certainly save a fair amount of memory and run faster as well.

Jerry Coffin 2009-10-30 17:02:33

You're right. I'll take a look at Loki next week, thanks.

ephes 2009-11-13 22:38:55

Answer 5

+1 A:

If your C++ implementation has it, and most do, try hash_map instead of std::map (it's sometimes available under std::hash_map).

If the lookups themselves are your computational bottleneck, this could be a significant speedup over std::map's binary tree.

orip 2009-10-30 19:09:50

Had some hard times getting hash_map to work (i have near zero experience using c++), but finally got it running. Here are the mandatory includes/defines/magic_spells:#include <hash_map>#define hash_map __gnu_cxx::hash_map#define hash __gnu_cxx::hashhash_map<int, float, hash<int> > scoremap;scoremap[row->docid] = score;Runtime and memory usage are about the same, compared with unordered_map, but i'll use unordered map since the compiler spits ugly "deprecated" warnings using hash_map. But +1 for providing the correct solution. I'm amazed by the quality of all answers here, kudos.

ephes 2009-11-13 21:34:13

ansaurus

tags:

views:

answers:

How to keep only the last duplicate when iterating through rows

related questions