views:

363

answers:

11

Hi,

I have a situation whereby I'm populating an ArrayList with "TransactionEvent"s. TransactionEvent has a property "transaction ID". In the large majority of cases each new event has a transaction ID greater the previous event's ID - However, this is not guaranteed; i.e. the data is almost sorted.

My question is this: How can I perform fast look-ups based on transaction ID? My current idea is to call Collections.binarySearch(...) and if this fails then perform a linear search. However, I notice the Javadoc states that the result of binarySearch is undefined is the data is unordered so I may have to roll my own implementation.

Additional:

  • I have tried using a map of index -> transaction ID but this approach is flawed because whenever an list element is updated / deleted I have to rebuild the entire map; i.e. any gains are erased by this.
  • This is not a case of premature-optimisation: The List is the basis for a TableModel currently performing very slowly when containing a large number of rows (100,000).

Any help appreciated.

A: 

From what you have said, it looks like fast look ups are the most important thing here.

So perhaps you should be using a HashMap instead of an ArrayList. In the HashMap, store your TransactionEvents using the TransactionID as the Key. Lookups in a HashMap are O(1).

Note that adding to the HashMap can become quite slow if you exceed its initial capacity - since it has to do a re-hash. If you can, try to initialise it with a best guess (err on the high side) as to the number if items it will hold.

With 100k rows, you might have to increase your java heap size to prevent OutOfMemoryErrors.

java -Xms<initial heap size> -Xmx<maximum heap size>

Defaults are:

java -Xms32m -Xmx128m

EDIT:

If ordering really is important you could use a SortedMap.

Winston Smith
@Joe: Thanks for the suggestion but I've already mentioned in the question that using a Map will not work - I need a List as I am layering a TableModel over the data structure. Also, I would have to repopulate the map when the list indices (i.e. row indices) were changed due to an event deletion / update.
Adamski
@Adamski - actually you indicated that storing a separate mapping of INDEX in the list to TransactionID did not work. This is quite different.
Winston Smith
@Joe: Can you please clarify? In your suggestion the Map would hold Transaction ID as the key - What would the value be? I need to determine the row index given a certain transaction ID.
Adamski
+3  A: 

Using a LinkedHashMap, which combines a double linked list which hash access, you should be able to interface with the TableModel as you are with an ArrayList but also access the entries via a hash lookup on TransactionID.

You can even replace (e.g. update) based on a key without affecting the iteration order.

Jon
@Jon: Traversal order is important but I also need efficient index-based look-ups as the data structure sits beneath a TableModel. Hence I really need an ArrayList but that's not to say I couldn't supplement my model with other data structures to improve look-up performance by ID.
Adamski
+3  A: 

You could keep the ArrayList sorted by searching for the insertion point as you add each TransactionEvent. Collections.binarySearch returns

index of the search key, if it is contained in the list; otherwise, (-(insertion point) - 1). The insertion point is defined as the point at which the key would be inserted into the list: the index of the first element greater than the key, or list.size(), if all elements in the list are less than the specified key. Note that this guarantees that the return value will be >= 0 if and only if the key is found.

Once you search for the insertion point you can use the ArrayList add(int index, Object element) method instead of just adding to the end of the list as you would normally. This will slow down each insertion by a small factor, but it will enable you to use binary search for fast look-up.

Bill the Lizard
+1 Thanks Bill - This is the best suggestion so far. The snag is that I want new TransactionEvents to appear at the end of the TableModel. I guess I could always impose this ordering using a RowSorter but presumably that would then re-sort the data whenever a row was added / updated?
Adamski
How about storing an additional ArrayList which contains the indexes of the sorted array, by insertion order? So to iterate in insertion order for your TableModel you would index into the sorted ArrayList via the extra ArrayList. To lookup by TransactionID you would binary search the sorted ArrayList.
Jon
@Jon: Good suggestion. The only reservation I have is the memory used by the extra objects created (the Integer indices stored in the additional list). I'd have to test it to be sure, but it might be more efficient to just have two ArrayLists of TransactionEvents, since each list would only store a reference to each object.
Bill the Lizard
A: 

You could keep your list sorted. If you insertion sort it as you add items, and the items to be added are almost sorted, then the insertions will still effectively run constant time. This would then allow you to binary search in logarithmic time.

James
A: 

I would use a binary search to get an approximate location of the id and then search outwards linearly. The down side to this is that if the id you're searching for isn't in the list then it will take O(n + log n).

Binary searches are very easy to implement and I recommend reading the wikipedia article.

James
+1: Thanks - Sounds like a possible way to go.
Adamski
A: 

I had the same problem. The solution I came up with is custom collection based the ArrayList which inlcudes Map of the all the elements too. This is not hard to do. If you'd like me to post the source code - let me know

eugener
+1  A: 

ArrayList is for toy-sized problems. 100.000 rows is getting a little out of toy space. That means you have to be more precise about the access patterns you need to support. A sorted ArrayList might be enough, and if processing speed is growing faster than your problem size, you might not want to bother, but a BTree will be faster at 100K elements.

ArrayList has the following problems with larger problem sizes:

  • add to the end is slow when the collection has to grow (copy all elements)
  • insert at a random position is slow because on average half the collection has to be moved one position

A two-level collection with fixed page size (e.g. BTree) can help because a grow will mean adding a (ideally) about sqrt(size) page and a random insert will max split one page in two.

With two needed sort orders, you can simply use two (sorted) BTrees

[edit] The answer to the earlier question is the key to the problem. For a 1000 element ArrayList, the insert costs 7 microseconds, for 1000000 elements 7 milliseconds. The BTree stays in the microseconds range (but could be twice as slow for 1000 element page size).

Indexed acces you can create by keeping an index of the number of elements in each page. If you set a dirty flag on each page you can use a background thread to update the start index of each page, or you can add bulk operations with delayed index building.

The index can be invalid, but it is just sqrt(size) large. For 100K elements, it is just incrementing 150 indexes on average. That takes microseconds, not milliseconds

Stephan Eggermont
According to answers against a previous post I made (http://stackoverflow.com/questions/1192586/efficient-tablemodel-implementation) System.arrayCopy is optimised sufficiently that I won't notice array elements being copied. With the BTree approach how can I efficiently retrieve values for TableModel's getValueAt(int, int) method? Any index mapping would be invalidated as soon as an element was removed from the structure.
Adamski
"Add to the end is slow when the collection has to grow (copy all elements)." I am not sure, but I doubt this is the case. My guess is that the JVM relies on an underlying realloc to do this, and most reallocs move the smaller of a) the list or b) the things necessary to expand the list where it is. 100,000 rows is generally going to be larger than most things, so it's more likely that something else will get copied than that the whole list will get copied.
Imagist
Imagist: no, it does not. Nobody does reallocs unless they are out of memory, and then you have worse problems
Stephan Eggermont
I just tried removing elements from random indices from an ArrayList of 100,000 elements. The average removal time on my machine was 0.062ms, which is an order of magntitude faster than doing a linear look-up / binary search in the same list; i.e. so the array copy appears to be highly optimised.
Adamski
Then your machine is ten times faster than the one in the earlier answer. That means the cut-off point where you need to move to a 2-level tree lies a factor 10 higher. You might want to try with 1000 and 1000000 elements.
Stephan Eggermont
A: 

My vote is that you insert into the list in order. Then you can do a binary search. A few notes:

  1. This will be faster than normal insertions because inserting into an ArrayList near the end is faster than inserting near the beginning (fewer elements need to be moved) and most of your insertions will be at or near the end (because they are almost-ordered).
  2. Normally you would find the insertion point for inserting into an ArrayList using a binary search algorithm. In this case, it's faster to search linearly, starting from the end, since most of your insertions will occur at or near the end.
Imagist
Unfortunately this would mean rows appearing at arbitrary points in the table - I need new elements to appear at the end.
Adamski
A: 

Why not just use a sorted collection as your table model instead of a list. TreeMap seems logical since your entries are all ordered. If you also need fast access by row or any other column, you can simply add a secondary map. Basically you are doing what database indexes do.

I thought for some reason that you could use the map.headSet(key) and find the kth entry -- this won't work. You need to be able to get from table row -> EventID (or close to it).

if you use a model like this

Map<EventID, Event> model = new TreeSet<EventID, Event>();

Conceptually your getValueAt() looks like this:

getValueAt(int row, column) {
 eventID = getSortPosition(row);
 Event e = model.headSet(eventID).next();
 return getColumn(e, column);
}

The key is being able to efficiently maintain a map from sort index -> key (an inverse map). This is non-trival since inserting a new event at the very top affects the absolute order of all those below it. It seems like there should be a CS answer here, but it escapes me.

Here is the most basic implementation: - on every insert, you update your map, then materialize your sorted map.

ArrayList<Event> orderedEvents = new ArrayList<Event>();
public void insert(Event event) {
 model.put(event.getID(), event);

 // update the 
 model.headSet().addAll(orderedEvents);
}

Your getValueAt() would be pretty simple.

getValueAt(int row, column) {w);
 Event e = orderedEvents.get(row);
 return getColumn(e, column);
}
  • this makes insertions O(n) instead of O(n log n) (still not great)

I think you should reconsider your UI design If you are having users browse a 100K row table, adding a search filter will solve your performance problem:

  • No user will EVER read 100k rows
  • If it is meaningfull for your users to search by eventID then this works great, when the users selects the eventID, you do: sortedMap.headSet( searchFilterID ) // take first 200 put them into your table
  • If it is meaningfull for users to search by time, then make a map from and do the same.
Justin
How would this work? Specifically, How would TableModel's getValueAt(int, int) method work?
Adamski
should have thought it through more carefully. Its still possible as i describe in my edit.
Justin
UI redesign: only if it is really shown as a table. A better visualization (2D or 3D) can easily handle that many rows. That is more than 20 pixels/element on a 1920*1200 screen, without scrolling :)
Stephan Eggermont
If showing all objects, the sorting constraint means that each insertion should trigger a full screen refresh, which is O(n) and materializing your sorted map is still the way to go. Since updates occur at the end, materializing the tail set for the new key onwards is typically constant time.
Justin
A TreeMap is really, really slow. It is a red-black tree, so it suffers from excessive memory consumption, doesn't have enough fan-out to work well with a modern memory architecture, and is not really what a database eindex is
Stephan Eggermont
I suggested TreeMap because it is available in Java for free, and gives you O(log n) insert/search performance. The outlined solution only requires the SortedMap interface, and is easy to code, if you want to implement a BTree or some other backing for SortedMap go ahead. Your worst case performance for an insert is O(log n) insertion with O(n) for materializing the modified part of the sorted array. By using the map, you can defer re-materializing until you need to have your JTable render a row whose sort order has changed since the last materialization.
Justin
A: 

My first answer wasn't really what you were looking for. Now that I understand the problem better, give this a try. I only implemented the key parts. This will be a little bit more memory intensive, but since I'm pretty sure the ArrayList stores the references, not the objects themselves, the memory difference shouldn't be too huge in comparison with the actual object storage.

class TransactionEventStore
{
    private ArrayList<TransactionEvent> byOrder, byId;

    private void insertByOrder(TransactionEvent e) { this.byOrder.add(e); }

    private void insertById(TransactionEvent e)
    {
        for(int i = this.byId.length() - 1; i > 0; i--)
            if(e.getId() > this.byId.get(i).getId())
            {
                this.byId.add(i,e);
                break;
            }
    }

    public void insert(TransactionEvent e)
    {
        this.insertByOrder(e);
        this.insertById(e);
    }
}

Now when you need to lookup by insertion order, look at this.byOrder and when you need to lookup by id, look at this.byId.

Imagist
A: 

I've cleaned things up a bit from my previous post. @Lizzard, your solution is best given the property that new entries are usually at the end. The solution below should perform better if you have random arrivals at the cost of more memory for the maps. It also lets you defer your array insertion (potentially O(n) worst case) until you actually need to draw the cell for a row below the earliest insertion point.

// sorted events (using natural ordering on eventID)
SortedSet<Event> model = new TreeSet<Event>();
ArrayList<Event> sortedList = new ArrayList<Event>();
Event lowestAddition, additionPrevEntry; // low water mark for insertions

public void insert(Event x) {
 if (x < lowestAddition) {
  Set<Event> headSet = model.headSet(x); // find the insertion point
  additionPrevEntry = headSet.isEmpty()?model.last():headSet.first();  
  lowestAddition = x;
 }

 model.add(x);  // add
}

public void materialize() {
 SortedSet<Event> tailSet = model.tailSet(additionPrevEntry);

 Event firstValue = tailSet.first();    // this element does not change its order
 Integer order = firstValue.getOrder(); // keep order on Event
 for (Event x : tailSet) {
  x.setOrder(order);
  sortedList.set(order, x);
  order++;
 }

 lowestAddition = null; additionPrevEntry = null;
}

Here is what your swing code looks like, I assume you are using Swing since you want a table model:

// now your model code uses the array
public Object getValueAt(int row, int col) {
 return getColumn(sortedList.elementAt(row), col);
}

// you can gain significant performance by deferring
// materialization until you acutally need it
public class DeferredJTable extends JTable {
 public void paintComponent(Graphics G, ...) {
  // if you knew what rows in the table were being drawn
  // ahead of time, you could further defer
  materialize();

  super.paintComponent();
 }
}
Justin