views:

322

answers:

4

Hi,

I am looking for an efficient data structure, that'll allow me to cue events ... that is, i will be having an app, where at any time in execution, it is possible, that an event will be raised for a future point in execution ... something like:

  • t=20: in 420 seconds, A occurs
  • t=25: in 13 seconds, B occurs
  • t=27: in 735 seconds, C occurs
  • ...

so i'd like to have a data structure, where i can put in any event in any time in the future and where i can get and (by doing so) remove all due events ... also, a plus would be, if i were able to remove an event from the datastructure (because it was canceled) ... not too important though, since i can simply flag it as cancelled ...

my first thought was, maybe to do some sort of tree, but i guess the removing-due-events part requires a lot of rebalancing ...

i am considering simply having an int hash, mapping timestamps to either null or stacks of events that are to occur at that point in time ... i think in scenarios, with a lot of events (possibly multiple every second - which is what i intend to work with), this actually isn't such a bad idea after all ...

so i am eager to hear your input ... :)


edit:

  • to be more specific: i think n here is at about 100K-1M, and i guess i might be having about 1-100 events/second ...
  • the t is of no special importance ... it is only to illustrate that a future event can be "enqueued" at any time ...


thanks

back2dos

+10  A: 

I believe you're looking for a Priority Queue with the timestamp of when the event occurs being the priority (well, lower timestamps would be higher priority)

Just a little elucidation with your use cases:

... where i can put in any event in any time in the future...

You'd insert into the priority queue with insertWithPriority, using the timestamp of when the event occurs. This would be O(lgN)

... and where i can get and (by doing so) remove all due events ...

You'd repeatedly call getTop (gets event with the lowest timestamp) collecting all elements up to the time of interest.

... also, a plus would be, if i were able to remove an event from the datastructure (because it was canceled) ... not too important though, since i can simply flag it as cancelled ..

This would be possible, but would be O(lgN) due to rebalancing.

Falaina
@Falaina - I think my answer is a fairly efficient implementation of a Priority Queue (or could become one without any changes to underlying datastructure) :)
DVK
+1  A: 

If your events have a well defined upper limit (e.g. no events later than 2 days in the future), you can simply have an array indexed by # of seconds from "beginning of time". The value of the array is a list of events at that offset.

Listing or Removal is very efficient - simply find the offset for the time where you wish to list or cut off and get or re-initialize the arrays pointed to by indices after that offset.

If your events can stretch out indefinitely into the future, then your own idea of using a hashmap from offsets to list of events is the best one, with a twist - have a sorted list (however you wish to implement it) of known offsets, that way you will have very efficient lookups (e.g. you won't have to loop over every key ion the map).

You don't need to delete anything from the list of known offsets so no issues with re-balancing - you merely delete from the arrays that hashmap points to.

Also, it seems unclear from your question whether there's any need to know "t" - the time when the event was raised. If you need to know it, store it as part of the event. but the reference to when the event should happen should all be absolute in relation to some starting point (if it's a hashmap with unbounded range, you can use epoch seconds, and if events have bounds like in the first array solution I listed, you should instead use "# of seconds since beginning of range" - e.g. from start of day yesterday.

DVK
@flybywire - did you bother actually reading? Every fiire-off time is mapped to an ARRAY of events. Would you care to list actual limitations other than you not reading carefully?
DVK
t is not of particular interest ... it was just to illustrate that events could be enqueued whenever ... the event's stamp is determined by the current timestamp plus the offset to the future ...
back2dos
the first approach is actually an interesting idea ... i don't have an upper bound for time offsets, but i could simply use it as a cache, and put events in far future into a different structure ... actually i am much more likely to get events in a near future than in a far future, but everything is possible ... i am a bit skeptical, when it comes to updating the array ... usually arrays are not very fast for resizing, even less from the front ... is there anything cheaper than O(n) (since the whole rest of the array needs to be copied)?
back2dos
If you don't have bounds, I would strongly recommend hashes. Even including the supplementary ordered list of fire-off times, te inserttion time is O(log M), where M is the amount of distinct fire-off times (e.g. if you have 1000 events happening once an hour throughout the day, your complexity os O(log 24) and not O(log 1000)
DVK
The memory allocation for sparse unbounded list of keys is what really would be the difference between choosing an array or a hash for higher-level structure.
DVK
@flybywire - getNext is implemented by getting the first element of the sorted list of fire-off time (O(1)) plus getting the first or last element off the list of events from the hash with the just-found key (O(1), though technically hash lookups are a bit mor than O(1))
DVK
@flybywire = removing due event (the one we just got) is just popping said event off the list (if the lists within a hash are implemented as stacks, it's O(1)). Removing ALL due events is O(1) + time to destruct the stack (e.g. however long stack.clear() takes in the language it's implemented in
DVK
@DVK but if you keep a sorted list of fire-off times then you can't have efficient update. In short - this question has a right answer and it is priority queue - and **not** your answer, even though in appreciation for your creativity I shall hereby remove my -1 that seems to have hurt you so much.
flybywire
@flybywire - Priority Queue update efficiency IIRC is usually O(log N). My solution is O(log M) where M < N (possibly greatly less than N as # of events rises). A smart priority queue update would be O(log M) as well, but I have a feeling an optimized implementation of a PQ will closely resemble my solution.
DVK
@DBK you said: "getNext is implemented by getting the first element of the sorted list of fire-off time (O(1))". When a new event arrives, how can you possibly update this sorted list in logarithmic time? Updating a sorted list is linear.
flybywire
Priority queue idea is much better than this idea. With priority queue, I can lookup the time of the next event with a single operation. With your idea, you have to scan the array until you find an non-empty list.
FogleBird
@FogleBird - if you are referring to the first (array) implementation, you will not only at worst need to scan a VERY short array (# of seconds in a day), but you could very easily optimize even that by keeping an index of know first entry, and re-freshing that index whenever you delete all events from that entry or the entry is in the past. That updating is only happening once per fire-off time and thus the search for first entry is O(1) like in a priority queue.
DVK
@FogleBird - if you are referring to the second implementation (hash + sorted list of fire-off times), then I explained in a comment above that getNext is O(1) (find first fire-off time in storted array in one step and pop off the first element in a stack pointed to by that fire off time via a hash in a second step).
DVK
@flybywire and @FogleBird: From Wikipedia on PQ: "Sorted list implementation: (O(n) insertion time, O(1) get-next time, O(n*log(n)) to build)" - this is - as noted - SLOWER than my hashmap solution (O(m) insertion time, O(1) get-next time)".
DVK
@fkiybywure - your last point is valid. Worst case scenario for updating sorted list is O(m), not O(log m). However, it still makes it faster than PQ where insert is O(n) where as we discussed n??m for large amount of entries to be scheduled.
DVK
For a priority queue, it's O(log n) to insert or remove one event, but the queue is compact and the next event is at a fixed location. An array of seconds is hardly compact (if there's ten events in the next day, that's slightly better than 1% of 1% entries used), doesn't generalize to indefinite time spans, and has somewhat worse time complexity (O(1) to insert, O(m) to find, where m is the number of time slots you have).
David Thornley
@David - Correct. An array is completely worse than a map with the usage scenario you described. However, for any production scgeduler I ever worked with, the density would have been at least 50% so it was a lot less wasteful (doesn't invalidate your point though for sparse schedules).
DVK
in the end, this was the most inspiring answer ... :)
back2dos
Thanks back2dos!!! I'd love to get a round of apologies from people who have trouble digesting algorithms that are not vouched for by Wikipedia, never mind analyzing the O(n)... but your accepting means so much more!
DVK
+2  A: 

How big is N? How often do you have to insert & remove items, compared to everything else going on? If this is more than 10% of total execution time, and if N is typically more than 100 (say) maybe, maybe it's time to be concerned about big-O. I've seen programs with priority queues implemented with fancy container algorithms, allocating iterators, hash maps, heaps, etc. and spending all their time in creating and releasing abstract objects, where the median queue length was like three.

ADDED: OK, since N ~ 10^6, and the frequency is ~ 100hz, you probably want some sort of binary tree or heap with O(log(N)) insertion/removal time. If you're willing to devote 1% of CPU time to this, that is 10^6 microseconds * 1% / 100 = 10^2 microseconds / operation. That should not be at all difficult, because if typical search depth is 20, at ~50ns per comparison, that is ~1 microsecond to do the search. Just make sure to keep it simple, not getting all wrapped up in abstract datatypes. You shouldn't have to worry much about time spent in allocating/freeing tree nodes because you only allocate/free one node per operation. Rebalancing need not be done frequently, like maybe only after every 1000 operations. If you can collect your insertions in batches, and then insert them in random order, that may prevent the tree from getting too unbalanced. If many of your events are simultaneous, you could add a small amount of noise to the time code, to prevent parts of the tree from becoming more like a linear list.

Mike Dunlavey
lol ... +1 for that length of 3 ... :) ... just updated my post ...
back2dos
+2  A: 

Ok, I'd like to thank you all for your answers - very interesting and helpful. :)

PriorityQueue is definitely the right term I was searching for - thanks for that. Now it's all about implementation.

Here is what I think:

Let N be the size of the queue and M be the average amount of events per timestamp ("concurrent" events so to speak) at the time of processing (the density of events will not be evenly distributed, the "far future" beeing much more sparse, but as time moves on, this area of time becomes much more dense (actually, I think the maximum density will be somewhere in the 4 to 12 hours future)). I am looking for a scalable solution, that performs well for considerably big M. The goal is to really process those M due events within one second, so I wanna spend the least time possible on finding them.

  1. Going for the simple tree approach, as suggested several times, I'll be having O(log N) insertion, which is quite good, I guess. The cost of processing one timestamp would be O(M*log N), if I am right, which is not so good anymore.
  2. An alternative would be, to have a tree with lists of events instead of single events. it should be feasible to implement some getlistForGivenStampAndCreateIfNoneExists-operation that'd be a little faster than going down the tree twice if no list exists. But anyway, as M grows, this shouldn't even matter too much. Thus insertion would be O(log N), as before, and processing would be at O(M+log N), which is also good, I think.
  3. The hash-of-lists-of-events approach, I formulated. This also should have O(1) insertion and O(M) processing cost, although this is not too trivial with hashes. Sounds cool, actually. Or am I missing something? Of course it is not so easy to make a hash perform well, but apart from that, are there any problems? Or is the hash the problem? Wikipedia states:
    "In a well-dimensioned hash table, the average cost (number of instructions) for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of key-value pairs, at constant average (indeed, amortized) cost per operation."
    A quick benchmark showed that the standard implementation for my platform seems to match this.
  4. The array-of-lists-of-events approach provided by DVK. This has O(1) insertion. Now that is good. But if I understand correctly, it has O(M+T) processing cost, with T being the size of the array (the number of time slots if you will), because removal from arrays comes at linear cost. Also, this only works if there is a maximum time offset.

Actually, I would like to discuss the array approach. O(M+T) is not good. Not at all. But I put some brains into it, and this is what I came up with:

First Idea: Lazyness

The O(T) could be crunched down by an arbitrary factor, introducting a bit of lazyness, but in the end it'd stay O(T). But how bad is that? Let's have T=2419200, which is 28 days. And then, once a day I'd clean it up (preferably while low load is expected). That'd waste less than 5% of the array. On my target platform, the copy operation takes 31msecs on a fairly old 2GHz core, so it doesn't seem such a bad idea after all.

Second Idea: Chunks

After thinking a little, I thought of this solution: a hash-of-intervals, an interval (I.e. given time frame) in turn being an array-of-lists-of-events. the intervals are all of equal sizes, preferably something simple, like days or maybe hours.

For insertion, I lookup the right interval through the hash (create if none exists), and in the interval, the right list-of-events (again create if none exists) and then just insert it, which is O(1).

For processing, I simply take the current interval, and process due events, by processing the currently due list-of-events, and then disposing it. The array stays of constant length, so we are at O(M) (which is quite the best you can get for processing M elements). Once the current interval is entirely processed (thus if the interval now represents the "past"), I simply dispose it at O(1). I can keep an extra reference to the current interval, eliminating the need to look it up, but I guess this doesn't provide any noticable improvement.


It seems to me, the second optimization is really the best solution, since it is fast and unbound. Choosing a good size for the intervals allows optimizing memory overhead vs. hash lookup overhead. I don't know, whether i should worry about the the hash lookup time at all. For high M, it shouldn't really matter, should it? Thus I'd choose an interval size of 1, which leads me back to approach number 3.

I'd be really greatful for any input on that.

back2dos
asking another question in an answer doesn't work; SO is not a threaded-discussion forum. You could edit the original question or post this as another question (with a link to the original) and stand a much better chance of getting this new question answered
Steven A. Lowe