views:

277

answers:

8

Hi there,

I subscribe to a data feed and from that create and maintain a structure using the index values on the INSERT/DELETE messages. I would like to ask the assembled cognoscenti whether they know of any algorithm which can deal with the piecemeal updates in an efficient way - generally batch-updates contain between two and six such messages.

The estimated size of the array is around 1000 elements.

Batch updates arrive as a list of messages ordered by index, which stipulate the insertion or deletion of an item at a given index. I expect most of the churn in the array to be nearer its start than its end.

It occurs to me that with some basic processing I can determine the range affected by the batch and the overall size-delta, and therefore move an unaffected tail section of the array just once.

Similarly, I could maintain a certain amount of free space before the first element and after the final element to do the least amount of copying possible.

Other optimisations include recognising updates such as the following:

DELETE 10, INSERT 10 - effectively a replace which requires no copying  
INSERT 10, DELETE 11 - as above  
DELETE 10, DELETE 10, DELETE 10 - bulk deletion can be optimised into one copy operation  
INSERT 11, INSERT 12, INSERT 13 - bulk insertion can be optimised into one copy operation  

and so on.

However, I'm wary of the overhead in performing the recognition step - it smacks of look-aheads and track-backs, which may well take more time than simply performing the copy.

Given the expected size of the array, tree structures seem heavyweight: some basic performance testing suggests that binary or self-balancing trees (in this case a red-black tree list-implementation) only begin to show performance advantages after around 15K - 20K elements: array-copies are significantly faster at lower sizes. I should probably add that I'm using Java for this implementation.

Any hints, tips or suggestions would be welcomed.

Cheers

Mike

A: 

The simplest would be to run through the updates and copy the array into a new array while applying the updates.

1000 is not that big, it is probably not worthwhile to optimize further.

And to make your life easier better use an ArrayList.

starblue
I've thought of doing this but may have to have a thousand or so models on the go at once - and while the box will undoubtedly cope with the load, I'd really like to find the "best" way of solving this problem. Thanks for your answer!
Michael_73
Even a million is not that big these days. If you optimize further make sure to profile it, to see if it is really significantly better.
starblue
A: 

Beyond sorting the individual updates (as you mentioned already) to try and consolidate things, I don't know that I'd bother much. 1000 elements is, frankly, nothing in the large scope of things. I have a system with 25M elements, using simple bulk copies, and it's (for our purposes) far beyond more than fast enough.

So, I won't put the "pre-mature optimization" hat on, but I might glance over at it on the bookshelf first.

Will Hartung
Yep, know that feeling...
Michael_73
A: 

Using a linked list (java.util.LinkedList) might be something to look into. Getting an element at a particular index is expensive of course, but it may be better than performing array copies.

Michael Angstadt
Quite right - and I've considered it. It's a case of horses for courses, as you say, and structures like a skip-list may moderate the indexing costs of a linked-list structure.
Michael_73
+1  A: 

Always weigh code clarity vs. optimization. If there is no performance issue now, just make sure that the code is clear. If there is a performance issue in the future, you'll know its exact nature then. Preparing for it now is an exercise in guesswork.

If you need to manipulate quite a bit, a linked list may be worthy.

For simple clear code, however, I would use apache commons collection utils for a raw array or an arraylist otherwise:

myArray = ArrayUtils.add(myArray, insertionIndex, newItem);

OR

ArrayList<> mylist = new ArrayList<>(Arrays.asList(myArray));
myList.add(insertionIndex, newItem);
Peter DeWeese
Hi Peter, thanks for your answer. Unfortunately a linked list doesn't seem to fit the by-index updates we receive. Mike
Michael_73
@Michael: ArrayList is not a linked list. It's built on an array and has `O(1)` indexing.
Gunslinger47
My recommendation is for array or ArrayList manipulation, not a linked list.
Peter DeWeese
@Peter, but my comment was intended to refer to your suggestion that a linked list may be worthy. @Gunslinger, I intended to reflect that we need random access and as you say, an ArrayList would give us that - having said that, without access to the underlying array we can't optimise by leaving free space at both the start and end, nor can we identify the range undergoing mutation and block-copy the trailing section in once operation. Further we couldn't do batch updates of more than one item.
Michael_73
A: 

There is an extremely simple to implement datastructure, named "Cartesian trees" or "Treaps", that allows O(log N) splits, concatenations, insertions and deletions on arrays (and many more things).

2-3 trees are also very simple to implement (my implementation of a somewhat more complex facility had just 1 bug after the first compilation) and fit your purpose.

jkff
Do you think that the data-structure overhead will beat an ArrayList for flat-out speed on a typical size of 1,000 elements? Thanks for your answer, jkff, am looking into both structures right now.
Michael_73
I'm not sure about the speed for 1000 elements, this needs to be benchmarked. Though, I guess the tree will be somewhat faster.
jkff
+1  A: 

If space is not a constraint and you are not going to have duplicates, go for Set datastructure, in particular Java's HashSet. The power of this data structure is the insertion and deletion are done in O(1) time which would best suit you if performance is 'the' criterion.

Moreover, whenever you speak of Arrays besides their fast retrieval, you have the serious limitation of numerous array copies that might happen which not only is going to take up space (for array growth) but also the efficiency will be poor as each of Insert/Delete might take O(n) time.

Bragboy
Those are insertions and deletions by element, not by index. The usefulness of this answer depends on why Michael is using indexes.
Gunslinger47
For an indexed list he would need a `HashMap`, not `HashSet`. And `HashMap`s are O(1), but with a significant constant. With only 1000 elements, I'm inclined to think even an ArrayList would be faster.
MAK
I'm using index values as that is what the data feed contains - there are undoubtedly more efficient ways of describing a change, but that's all I have to work with. What's worse (and I haven't burdened you all with it) is that the list is actually unordered and index values become valid only once all prior updates have been applied to the array.
Michael_73
+2  A: 

In general, if you have the changes listed by index order, you can build a simple loop that only copies things once. Here's some pseudocode:

array items;
array changes; // contains a structure with index, type, an optional data members
array out; // empty, possibly with ensureCapacity(items.length)
int c = 0, delta = 0;
// c is the current change
//delta tracks how indexing has changed by previous operations
for (i = 0; i < items.length; i++) {
    if c < changes.length {
        curchange = changes[c]
        if (i + delta) == curchange.index {
            c++;
            if (curchange.type == INSERT) {
                out.add(curchange.data)
                delta--;
            } else {
                delta++;
                continue; // skip copying i
            }
        }
    }
    out.add(items[i])
}
for (; c < changes.length; c++) { // handle trailing inserts
    assert(c.index == out.length && c.type == INSERT)
    out.add(c.data);
}

That runs through the input array once, and builds the output array with all changes made.

Note that this doesn't handle multiple inserts at the same location. It would make the code a bit more elaborate to do that but it's not too hard.

However, it will always run all the way through the array, once per batch. A slightly tougher change would be to keep a temporary around and do the changes in-place with two index variables; then, if you hit the end of the change list, you could break out of the loop early and not touch the rest of the list.

Walter Mundt
Looks good Walter, will have a think about it...
Michael_73
Your answer is pretty much what I was thinking.
Mike Dunlavey
A: 

If that's truly what your data set looks like, you might consider duplicate tracking with a Collection (like a HashMap). The Array would be your ordered list with each activity listed in order, and your Collection would be indices to the array.

For example:

class EventQueue
{
  Vector eventQueue;
  HashMap eventMap;

  public synchronized Event getNextEvent()
  {
      Event event = eventQueue.remove(0);
      eventMap.remove(event.getId());  // this would be 10 from 'INSERT 10' 
                                       // in the sample from the OP
  }

  public synchronized addEvent(Event e)
  {
     if( eventMap.containsKey(e.getId())
     {
        // replace events that already exist
        int idx = eventMap.get(e.getId());
        eventQueue.removeElementAt(idx);
        eventQueue.add(idx, e);
     } else {
        // add new events
        eventQueue.add(e);
        eventMap.add(e.getId(), eventQueue.size()); // may be off by one...
     }
  }

  public boolean isReady()
  {
    return eventQueue.size() > 0;
  }
}

class FeedListener extends Thread {
 EventQueue queue;
 EventFeed feed;
 ...
 public void run()
 {
    while(running) {
       sleep(sleepTime);
       if( feed.isEventReady() ) {
         queue.addEvent(feed.getEvent());
       }
    }
 }
}

abstract class EventHandler extends Thread {
 EventQueue queue;
 ...
 public void run()
 {
   while(running)
   {
     sleep(sleepTime);
     if( queue.isReady() )
     {
       Event event = queue.getNextEvent();
       handleEvent(event);
     }
   }
 }

 public abstract void handleEvent(Event event);
}

atk