views:

272

answers:

8

Hi,

This is similar to a recent question.

I will be maintaining sorted a list of values. I will be inserting items of arbitrary value into the list. Each time I insert a value, I would like to determine its ordinal position in the list (is it 1st, 2nd, 1000th). What is the most efficient data structure and algorithm for accomplishing this? There are obviously many algorithms which could allow you to do this but I don't see any way to easily do this using simple STL or QT template functionality. Ideally, I would like to know about existing open source C++ libraries or sample code that can do this.

I can imagine how to modify a B-tree or similar algorithm for this purpose but it seems like there should be an easier way.

Edit3:

Mike Seymour pretty well confirmed that, as I wrote in my original post, that there is indeed no way to accomplish this task using simple STL. So I'm looking for a good btree, balanced-tree or similar open source c++ template which can accomplish without modification or with the least modification possible - Pavel Shved showed this was possible but I'd prefer not to dive into implementing a balanced tree myself.

(the history should show my unsuccessful efforts to modify Mathieu's code to be O(log N) using make_heap)

Edit 4:

I still give credit to Pavel for pointing out that btree can give a solution to this, I have to mention that simplest way to achieve this kind of functionality without implementing a custom btree c++ template of your own is to use an in-memory database. This would give you log n and is fairly easy to implement.

A: 

Why do you need the ordinal position? As soon as you insert another item in the list the ordinal positions of other items later in the list will change, so there doesn't seem to be much point in finding the ordinal position when you do an insert.

It may be better to simply append elements to a vector, sort and then use a binary search to find the ordinal position, but it depends on what you are really trying to achieve

Chris Card
I don't care about the exact ordinal position but I do care about approximate ordinal position. But I can't get this easily from the value of the item. The plan you describe would be costly over time. Looking for algorithm and hopefully code that is O(log(n))...
Joe Soul-bringer
+1  A: 

Note that the std::list insert(it, value) member function returns an iterator to the newly inserted element. Maybe it can help?

Didier Trosset
Yes, wouldn't help unless the list is sorted by value. Push_Heap sorts it but doesn't return an iterator... if I could use the iterator After push_heap is calld, it might but I don't think I can...
Joe Soul-bringer
Using a list would still be O(n), both to find the insert position, and to find the ordinal of the iterator.
Mike Seymour
A: 

if you have an iterator that you want to find the index of then use std::distance, which is either O(1) or O(n) depending on the container, however the O(1) containers are going to have O(n) inserts so overall you are looking at an O(n) algorithm with any stl container.

as others have said its not immediatly obvious why this is useful?

jk
Perhaps with STL but as Pavel outlines, it's easy to describe an O(log(n)), it just may not exist in STL - but it might, haven't investigated set.
Joe Soul-bringer
yes i meant with STL containers, set has O(n) distance
jk
+7  A: 

Binary tree is fine with this. Its modification is easy as well: just keep in each node the number of nodes in its subtree.

After you inserted a node, perform a search for it again by walking from root to that node. And recursively update the index:

if (traverse to left subtree)
  index = index_on_previous_stage;
if (traverse to right subtree)
  index = index_on_previous_stage + left_subtree_size + 1;
if (found)
  return index + left_subtree_size;

This will take O(log N) time, just like inserting.

Pavel Shved
O(log N): Only if your tree is always balanced. In which case you need to factor in the time to rebalance the tree after every insert.
jmucchiello
Re-balancing also takes O(logN), updating subtree sizes included, so that doesn't matter.
Pavel Shved
If you don't want to roll your own balanced tree, you could do worse than starting with an open source implementation of `std::set`. For example, the GNU library has a red-black tree in `<bits/stl_tree.h>`.
Mike Seymour
Mike, why didn't you say so earlier...
Joe Soul-bringer
I should mention that a glance at the code referenced by Mike actually gave me the impression this would be one of the harder templates to modify...
Joe Soul-bringer
+6  A: 

I think you can std::set here. It provides you sorting behavior also returns the position of the iterator where the value is inserted. From that position you can get the index. For example:

std::set<int> s;
std::pair<std::set<int>::iterator, bool> aPair = s.insert(5);
size_t index = std::distance(s.begin(), aPair.first) ;
Naveen
Can I insert arbitrary out of order items this way?
Joe Soul-bringer
no set is maintains sorted order (which seems to be what the question asks for) perhaps you need to clarify what the question is?
jk
Yes..set will sort the values internally
Naveen
`std::distance` is O(n), isn't it?
Pavel Shved
yes, std::distance() is O(n) as set has a bidirectional iterator
Naveen
You probably want a `multiset` rather than a `set`, unless you want to discard duplicate values. In that case, remember to check `aPair.second` to see if `insert` succeeded.
Mike Seymour
A: 

If you have the iterator to the item (as suggested by dtrosset), you can use std::distance (e.g. std::distance(my_list.begin(), item_it))

Niklas
+1  A: 

If, as you say in one of your comments, you only need an approximate ordinal position, you could estimate this from the range of values you already have - you only need to read the first and last values in the collection in constant time, something like this:

multiset<int> values;

values.insert(value);
int ordinal = values.size() * (value - values.front()) /
                              (values.back()-values.front());

To improve the approximation, you could keep track of statistical properties (mean and variance, and possibly higher-order moments for better accuracy) of the values as you add them to a multiset. This will still be constant time. Here's a vague sketch of the sort of thing you might do:

class SortedValues : public multiset<int>
{
public:
    SortedValues() : sum(0), sum2(0) {}

    int insert(int value)
    {
        // Insert the value and update the running totals
        multiset<int>::insert(value);
        sum += value;
        sum2 += value*value;

        // Calculate the mean and deviation.
        const float mean = float(sum) / size();
        const float deviation = sqrt(mean*mean - float(sum2)/size());

        // This function is left as an exercise for the reader.
        return size() * EstimatePercentile(value, mean, deviation);
    }

private:
    int sum;
    int sum2;
};
Mike Seymour
I will doing this with a variety of sets having very distinct means and variance *and* I am interested in the "order of magnitude" of each item - is it in the first 10 values, the first 100, the first 1000. Max, min, mean and variance unfortunately won't get me very far as estimators of this.
Joe Soul-bringer
+1  A: 

If you want ordinal position, you want a container which models the RandomAccessContainer Concept... basically, a std::vector.

Operations of sorts on a std::vector are relatively fast, and you can get to the position you wish using std::lower_bound or std::upper_bound, you can decide by yourself if you want multiple values at once, to retrieve all equal values, a good way is to use std::equal_range which basically gives you a the same result as applying both the lower and upper bounds but with a better complexity.

Now, for the ordinal position, the great news is that std::distance as a O(1) complexity on models of RandomAccessIterator.

typedef std::vector<int> ints_t;
typedef ints_t::iterator iterator;

ints_t myInts;

for (iterator it = another.begin(), end = another.end(); it != end; ++it)
{
  int myValue = *it;
  iterator search = std::lower_bound(myInts.begin(), myInts.end(), myValue);
  myInts.insert(search, myValue);
  std::cout << "Inserted " << myValue << " at "
            << std::distance(myInts.begin(), search) << "\n";
  // Not necessary to flush there, that would slow things down
}


// Find all values equal to 50
std::pair<iterator,iterator> myPair =
    std::equal_range(myInts.begin(), myInts.end(), 50);
std::cout << "There are " << std::distance(myPair.first,myPair.second)
          << " values '50' in the vector, starting at index "
          << std::distance(myInts.begin(), myPair.first) << std::endl;

Easy, isn't it ?

std::lower_bound, std::upper_bound and std::equal_range have a O(log(n)) complexity and std::distance has a O(1) complexity, so everything there is quite efficient...

EDIT: as underlined in the comments >> inserting is actually O(n) since you have to move the elements around.

Matthieu M.
Of course, it's easy... with O(n) insertions.
Pavel Shved
The problem here is that `vector<>::insert` is O(n).
Mike Seymour
I think I can get around the O(n) for insert, see my edited question
Joe Soul-bringer
Crap, hadn't thought about actually inserting the items!
Matthieu M.