ansaurus

Question

How to efficiently (performance) remove many items from List in Java?

Answer 1

+1 A:

I would imagine that building a new list, rather than modifying the existing list, would be more performant - especially when the number of items is as large as you indicate. This assumes, your list is an ArrayList, not a LinkedList. For a non-circular LinkedList, insertion is O(n), but removal at an existing iterator position is O(1); in which case your naive algorithm should be sufficiently performant.

Unless the list is a LinkedList, the cost of shifting the list each time you call remove() is likely one of the most expensive parts of the implementation. For array lists, I would consider using:

public static <T> List<T> removeMany(List<T> items) {
    List<T> newList = new ArrayList<T>(items.size());
    Iterator<T> iter = items.iterator();
    while (iter.hasNext()) {
        T item = iter.next();
        // <cond> goes here
        if (/*<cond>: */i++ % 2 != 0) {
            newList.add(item);
        }
    }
    return newList;
}

LBushkin 2010-01-11 18:09:10

You should negate the condition and you don't need `i++`.

notnoop 2010-01-11 18:14:40

Yes, I agree, I've made those edits.

LBushkin 2010-01-11 18:16:09

I think you forgot to negate removal condition to get retain condition ;-) but I understand your point and it is (IMHO) best solution I know for now...

WildWezyr 2010-01-11 18:19:19

"performant" is not a word http://weblogs.asp.net/jgalloway/archive/2007/05/10/performant-isn-t-a-word.aspx

Steve Kuo 2010-01-11 18:45:54

@notnoop i'm just curious: why you suggest removing i++ as separate statement? will it help in any way? i intentionally left it alone because it is not part of removal condition (cond) and code seems few chars longer but clearer... micro-separation of concerns ;-).

WildWezyr 2010-01-11 19:30:21

Saying something is not a word even when it is in common use in both written and spoken parlance seems foolish to me. Languages are living entities and evolve as new concepts and ideas arise and need to be communicated efficiently. Having a dictionary recognize it, does not make it a word - having people use and understand it, does.

LBushkin 2010-01-11 19:33:32

I don't get why you use generics and then fail to eliminate the three superfluous lines of Iterator nonsense when you could replace it with `for (T item : items)`. An iterator is only useful if you might be calling itr.remove() imo.

Jherico 2010-01-11 19:57:39

@Jherico: please stop over-optimizing code length ;-). given code is kept as similar to my original example as possible, so it is easier to understand. i was prompting for code performance not code conciseness ;-).

WildWezyr 2010-01-11 20:16:37

@Jherico: but you are right, of course.

WildWezyr 2010-01-11 20:17:23

Answer 2

+1 A:

One thing you could try is to use a LinkedList instead of an ArrayList, as with an ArrayList all other elements need to be copied if elements are removed from within the list.

Fabian Steeg 2010-01-11 18:11:04

Why is using LinkedList any better in my case than solution given by LBushkin (create new list)? Or is it just as good (considering performance) as new list?

WildWezyr 2010-01-11 18:22:36

I think for your case adding to a new `LinkedList` or removing from an existing `LinkedList` should be the same, as both are O(1), and you remove half of the elements. If you remove more elements than you leave, adding to a new list should be faster, and vice versa.

Fabian Steeg 2010-01-11 18:55:52

Answer 3

+5 A:

Removing a lot of elements from an ArrayList is an O(n^2) operation. I would recommend simply using a LinkedList that's more optimized for insertion and removal (but not for random access). LinkedList has a bit of a memory overhead.

If you do need to keep ArrayList, then you are better off creating a new list.

Update: Comparing with creating a new list:

Reusing the same list, the main cost is coming from deleting the node and updating the appropriate pointers in LinkedList. This is a constant operation for any node.

When constructing a new list, the main cost is coming from creating the list, and initializing array entries. Both are cheap operations. You might incurre the cost of resizing the new list backend array as well; assuming that the final array is larger than half of the incoming array.

So if you were to remove only one element, then LinkedList approach is probably faster. If you were to delete all nodes except for one, probably the new list approach is faster.

There are more complications when you bring memory management and GC. I'd like to leave these out.

The best option is to implement the alternatives yourself and benchmark the results when running your typical load.

notnoop 2010-01-11 18:11:42

Why is using LinkedList any better in my case than solution given by LBushkin (create new list)? Or is it just as good (considering performance) as new list?

WildWezyr 2010-01-11 18:23:17

It depends on your usage, if you have a need to remove a large number of items from the list often or if your function is simply "given this list, return a list with N items removed".

matt b 2010-01-11 18:28:56

Probably also depends on how many items you are removing vs how many are in the list, i.e. if you have a list with 10,000 items and you only need to remove 2 versus a list with 10,000 items where you need to remove 9,999 of them.

matt b 2010-01-11 18:29:58

@WildWezyr, I updated my answer.

notnoop 2010-01-11 18:32:06

@matt b: If the order of the elements in the list is unimportant, it is actually possible to make both LinkedList and ArrayList equally performant for removal. For example, rather than calling `remove()` on the element of an ArrayList, you would replace the value at the current index with the item at the tail of the list. And then call remove() on the last item. This causes no items to be shifted, but does change the order of items in the list.

LBushkin 2010-01-11 18:35:42

@notnoop: there are generally two kinds of approach for good solution: 1) create new list, 2) use LinkedList as input list. i will definitely test these two kinds and post my results here. but before i do that... i think removing one item from LinkedList will not give significantly better performance as whole list must be traversed in both approaches. this traversing gives O(n) total cost no mather how many elements are to be removed. this is just my thought before performing tests...

WildWezyr 2010-01-11 20:00:02

It's true that you do need to traverse the entire list in both. However, one involves an additional pointer modification, while the other requires allocation and a lot of pointer assignments. `O(n)` operations are "equivalent" asymptotically, but in practice they differ a lot.

notnoop 2010-01-11 20:26:14

@LBushkin: I used your concept to replace unnecessary values at current index with items from tail. It is MagicRemoveManyPerformer in my own answer. You are right - it performs as good as LinkedRemoveManyPerformer while other solutions perform worse. Thanks for this idea!

WildWezyr 2010-01-12 13:11:39

Answer 4

+4 A:

I would make a new List to add the items to, since removing an item from the middle of a List is quite expensive.

public static List<T> removeMany(List<T> items) {
    List<T> tempList = new ArrayList<T>(items.size()/2); //if about half the elements are going to be removed
    Iterator<T> iter = items.iterator();
    while (item : items) {
        // <cond> goes here
        if (/*<cond>: */i % 2 != 0) {
            tempList.add(item);
        }
    }
    return tempList;
}

EDIT: I haven't tested this, so there may very well be small syntax errors.

Second EDIT: Using a LinkedList is better when you don't need random access but fast add times.

BUT...

The constant factor for ArrayList is smaller than that for LinkedList (Ref). Since you can make a reasonable guess of how many elements will be removed (you said "about half" in your question), adding an element to the end of an ArrayList is O(1) as long as you don't have to re-allocate it. So, if you can make a reasonable guess, I would expect the ArrayList to be marginally faster than the LinkedList in most cases. (This applies to the code I have posted. In your naive implementatation, I think LinkedList will be faster).

Chinmay Kanchi 2010-01-11 18:17:04

Answer 5

A:

Try implementing recursion into your algorithm.

AlastairDewar 2010-01-11 18:24:52

Could you give an example of what you mean?

WildWezyr 2010-01-11 18:26:28

-1 Recursion is almost always slower than iteration. In addition, this is a very simple algorithm that doesn't benefit in the slightest from recursion. In C/C++, recursion _might_ have been a good choice _if_ the length of the list was not known, but that isn't the case here.

Chinmay Kanchi 2010-01-11 18:35:57

@chinmay, recursion isn't always slower than iteration, especially with tail optimizations and such.

notnoop 2010-01-11 18:55:40

True, but it usually is. And if it isn't the, performance is roughly equivalent. Tail-call optimization essentially turns the recursion into an iteration. Recursion doesn't make sense for something as simple as this, especially when performance is the most important thing.

Chinmay Kanchi 2010-01-11 19:39:09

Besides, the Sun JVM/JDK doesn't currently support tail-call optimization (I had to look this up).

Chinmay Kanchi 2010-01-11 19:44:50

I can only assume this was a joke answer.

Kevin Bourrillion 2010-01-19 18:51:05

Answer 6

A:

Maybe a List isn't the optimal data structure for you? Can you change it? Perhaps you can use a tree where the items are sorted into in a way that deleting one node removes all items that meet the condition? Or that at least speed up your operations?

In your simplistic example using two lists (one with the items where i % 2 != 0 is true and one with the items where i % 2 != 0 is false) could serve well. But this is of course very domain dependent.

Arne 2010-01-11 18:52:25

removal conditions may vary and they can be predetermined, so there is no way to prepare items in special order/structe etc. because it is not known when source list (named items) is being populated with items.

WildWezyr 2010-01-11 19:34:44

Answer 7

+6 A:

As others have said, your first inclination should be to just build up a second list.

But, if you'd like to also try out editing the list in-place, the efficient way to do that is to use Iterables.removeIf() from Guava. If its argument is a list, it coalesces the retained elements toward the front then simply chops off the end -- much faster than removing() interior elements one by one.

Kevin Bourrillion 2010-01-11 19:22:41

Damn, was totally coming in here to pimp Guava (or google-collections, if you need something that's already available in binary form and can be found in the public maven repos... gentle prod, Kevin) but you've beaten me to it.

Cowan 2010-01-12 00:35:47

Yeah, unfortunately Iterables.removeIf() did not exist in google-collections; it's new in Guava since then!

Kevin Bourrillion 2010-01-12 02:24:28

Answer 8

+1 A:

Use Apache Commons Collections. Specifically this function. This is implemented in essentially the same way that people are suggesting that you implement it (i.e. create a new list and then add to it).

Jherico 2010-01-11 20:01:54

Answer 9

+1 A:

Since speed is the most important metric, there's the possibility of using more memory and doing less recreation of lists (as mentioned in my comment). Actual performance impact would be fully dependent on how the functionality is used, though.

The algorithm assumes that at least one of the following is true:

all elements of the original list do not need to be tested. This could happen if we're really looking for the first N elements that match our condition, rather than all elements that match our condition.
it's more expensive to copy the list into new memory. This could happen if the original list uses more than 50% of allocated memory, so working in-place could be better or if memory operations turn out to be slower (that would be an unexpected result).
the speed penalty of removing elements from the list is too large to accept all at once, but spreading that penalty across multiple operations is acceptable, even if the overall penalty is larger than taking it all at once. This is like taking out a $200K mortgage: paying $1000 per month for 30 years is affordable on a monthly basis and has the benefits of owning a home and equity, even though the overall payment is 360K over the life of the loan.

Disclaimer: There's prolly syntax errors - I didn't try compiling anything.

First, subclass the ArrayList

public class ConditionalArrayList extends ArrayList {

  public Iterator iterator(Condition condition)
  { 
    return listIterator(condition);
  }

  public ListIterator listIterator(Condition condition)
  {
    return new ConditionalArrayListIterator(this.iterator(),condition); 
  }

  public ListIterator listIterator(){ return iterator(); }
  public iterator(){ 
    throw new InvalidArgumentException("You must specify a condition for the iterator"); 
  }
}

Then we need the helper classes:

public class ConditionalArrayListIterator implements ListIterator
{
  private ListIterator listIterator;
  Condition condition;

  // the two following flags are used as a quick optimization so that 
  // we don't repeat tests on known-good elements unnecessarially.
  boolean nextKnownGood = false;
  boolean prevKnownGood = false;

  public ConditionalArrayListIterator(ListIterator listIterator, Condition condition)
  {
    this.listIterator = listIterator;
    this.condition = condition;
  }

  public void add(Object o){ listIterator.add(o); }

  /**
   * Note that this it is extremely inefficient to 
   * call hasNext() and hasPrev() alternatively when
   * there's a bunch of non-matching elements between
   * two matching elements.
   */
  public boolean hasNext()
  { 
     if( nextKnownGood ) return true;

     /* find the next object in the list that 
      * matches our condition, if any.
      */
     while( ! listIterator.hasNext() )
     {
       Object next = listIterator.next();
       if( condition.matches(next) ) {
         listIterator.set(next);
         nextKnownGood = true;
         return true;
       }
     }

     nextKnownGood = false;
     // no matching element was found.
     return false;
  }

  /**
   *  See hasPrevious for efficiency notes.
   *  Copy & paste of hasNext().
   */
  public boolean hasPrevious()
  { 
     if( prevKnownGood ) return true;

     /* find the next object in the list that 
      * matches our condition, if any.
      */
     while( ! listIterator.hasPrevious() )
     {
       Object prev = listIterator.next();
       if( condition.matches(prev) ) {
         prevKnownGood = true;
         listIterator.set(prev);
         return true;
       }
     }

     // no matching element was found.
     prevKnwonGood = false;
     return false;
  }

  /** see hasNext() for efficiency note **/
  public Object next()
  {
     if( nextKnownGood || hasNext() ) 
     { 
       prevKnownGood = nextKnownGood;
       nextKnownGood = false;
       return listIterator.next();
     }

     throw NoSuchElementException("No more matching elements");
  }

  /** see hasNext() for efficiency note; copy & paste of next() **/
  public Object previous()
  {
     if( prevKnownGood || hasPrevious() ) 
     { 
       nextKnownGood = prevKnownGood;
       prevKnownGood = false;
       return listIterator.previous();                        
     }
     throw NoSuchElementException("No more matching elements");
  }

  /** 
   * Note that nextIndex() and previousIndex() return the array index
   * of the value, not the number of results that this class has returned.
   * if this isn't good for you, just maintain your own current index and
   * increment or decriment in next() and previous()
   */
  public int nextIndex(){ return listIterator.previousIndex(); }
  public int previousIndex(){ return listIterator.previousIndex(); }

  public remove(){ listIterator.remove(); }
  public set(Object o) { listIterator.set(o); }
}

and, of course, we need the condition interface:

/** much like a comparator... **/
public interface Condition
{
  public boolean matches(Object obj);
}

And a condition with which to test

public class IsEvenCondition {
{
  public boolean matches(Object obj){ return (Number(obj)).intValue() % 2 == 0;
}

and we're finally ready for some test code


    Condition condition = new IsEvenCondition();

    System.out.println("preparing items");
    startMillis = System.currentTimeMillis();
    List<Integer> items = new ArrayList<Integer>(); // Integer is for demo
    for (int i = 0; i < 1000000; i++) {
        items.add(i * 3); // just for demo
    }
    endMillis = System.currentTimeMillis();
    System.out.println("It took " + (endmillis-startmillis) + " to prepare the list. ");

    System.out.println("deleting items");
    startMillis = System.currentTimeMillis();
    // we don't actually ever remove from this list, so 
    // removeMany is effectively "instantaneous"
    // items = removeMany(items);
    endMillis = System.currentTimeMillis();
    System.out.println("after remove: items.size=" + items.size() + 
            " and it took " + (endMillis - startMillis) + " milli(s)");
    System.out.println("--> NOTE: Nothing is actually removed.  This algorithm uses extra"
                       + " memory to avoid modifying or duplicating the original list.");

    System.out.println("About to iterate through the list");
    startMillis = System.currentTimeMillis();
    int count = iterate(items, condition);
    endMillis = System.currentTimeMillis();
    System.out.println("after iteration: items.size=" + items.size() + 
            " count=" + count + " and it took " + (endMillis - startMillis) + " milli(s)");
    System.out.println("--> NOTE: this should be somewhat inefficient."
                       + " mostly due to overhead of multiple classes."
                       + " This algorithm is designed (hoped) to be faster than "
                       + " an algorithm where all elements of the list are used.");

    System.out.println("About to iterate through the list");
    startMillis = System.currentTimeMillis();
    int total = addFirst(30, items, condition);
    endMillis = System.currentTimeMillis();
    System.out.println("after totalling first 30 elements: total=" + total + 
            " and it took " + (endMillis - startMillis) + " milli(s)");

...

private int iterate(List<Integer> items, Condition condition)
{
  // the i++ and return value are really to prevent JVM optimization
  // - just to be safe.
  Iterator iter = items.listIterator(condition);
  for( int i=0; iter.hasNext()); i++){ iter.next(); }
  return i;
}

private int addFirst(int n, List<Integer> items, Condition condition)
{
  int total = 0;
  Iterator iter = items.listIterator(condition);
  for(int i=0; i<n;i++)
  {
    total += ((Integer)iter.next()).intValue();
  }
}

atk 2010-01-12 07:02:22

this seems overcomplicated. i don't know where your initial assumptions came from. i have specified my problem clearly (IMHO) and even given test code for it. all that matters is run time of my test code with different removeMany implementations. the faster, the better - it is as simple as that.

WildWezyr 2010-01-12 08:40:40

@WildWezyr: It very well could be an overcomplicated solution to your problem. That's why I asked the extra questions in my comment on the question (beyond "is speed the only factor").

atk 2010-01-12 12:57:06

Also, it may simply be the implementation that's overly complicated. If you don't need a generalized solution, and if you're performing operating on the data after removing items exactly once, you could still use the underlying algorithm.

atk 2010-01-12 13:02:08

Answer 10

+2 A:

I'm sorry, but all these answers are missing the point, I think: You probably don't have to, and probably shouldn't, use a List.

If this kind of "query" is common, why not build an ordered data structure that eliminates the need to traverse all the data nodes? You don't tell us enough about the problem, but given the example you provide a simple tree could do the trick. There's an insertion overhead per item, but you can very quickly find the subtree containing nodes that match , and you therefore avoid most of the comparisons you're doing now.

Furthermore:

Depending on the exact problem, and the exact data structure you set up, you can speed up deletion -- if the nodes you want to kill do reduce to a subtree or something of the sort, you just drop that subtree, rather than updating a whole slew of list nodes.
Each time you remove a list item, you are updating pointers -- eg lastNode.next and nextNode.prev or something -- but if it turns out you also want to remove the nextNode, then the pointer update you just caused is thrown away by a new update.)

fullreset 2010-01-12 09:56:07

probably you are right: better source structure should perform better in some cases. but it is hard to select dedicated structure if you don't know what removal condition will be. if removal condition is fixed - then better data structure can be selected at the very beginning. however it is not the case in my question, so solution must operate on List and this is what i'm asking for - what about efficient removal from List.

WildWezyr 2010-01-12 11:09:21

Answer 11

+3 A:

OK, it's time for test results of proposed approaches. Here what approaches I have tested (name of each approach is also class name in my sources):

NaiveRemoveManyPerformer - ArrayList with iterator and remove - first and naive implementation given in my question.
BetterNaiveRemoveManyPerformer - ArrayList with backward iteration and removal from end to front.
LinkedRemoveManyPerformer - naive iterator and remove but working on LinkedList. Disadventage: works only for LinkedList.
CreateNewRemoveManyPerformer - ArrayList is made as a copy (only retained elements are added), iterator is used to traverse input ArrayList.
SmartCreateNewRemoveManyPerformer - better CreateNewRemoveManyPerformer - initial size (capacity) of result ArrayList is set to final list size. Disadvantage: you must know final size of list when starting.
FasterSmartCreateNewRemoveManyPerformer - even better (?) SmartCreateNewRemoveManyPerformer - use item index (items.get(idx)) instead of iterator.
MagicRemoveManyPerformer - works in place (no list copy) for ArrayList and compacts holes (removed items) from beginning with items from end of the list. Disadventage: this approach changes order of items in list.
ForwardInPlaceRemoveManyPerformer - works in place for ArrayList - move retaining items to fill holes, finally subList is returned (no final removal or clear).
GuavaArrayListRemoveManyPerformer - Google Guava Iterables.removeIf used for ArrayList - almost the same as ForwardInPlaceRemoveManyPerformer but does final removal of items at the end of list.

Full source code is given at the end of this answer.

Tests where performed with different list sizes (from 10,000 items to 10,000,000 items) and different remove factors (specifying how many items must be removed from list).

As I posted here in comments for other answers - I have thought that copying items from ArrayList to second ArrayList will be faster than iterating LinkedList and just removing items. Sun's Java Documentation says that constant factor of ArrayList is low compared to that for the LinkedList implementation, but surprisingly this is not the case in my problem.

In practice LinkedList with simple iteration and removal has best performance in most cases (this approach is implemented in LinkedRemoveManyPerformer). Usually only MagicRemoveManyPerformer performance is comparable to LinkedRemoveManyPerformer, other approaches are significantly slower. Google Guava GuavaArrayListRemoveManyPerformer is slower than hand made similar code (because my code does not remove unnecessary items at end of list).

Example results for removing 500,000 items from 1,000,000 source items:

NaiveRemoveManyPerformer: test not performed - I'm not that patient, but it performs worse than BetterNaiveRemoveManyPerformer.
BetterNaiveRemoveManyPerformer: 226080 milli(s)
LinkedRemoveManyPerformer: 69 milli(s)
CreateNewRemoveManyPerformer: 246 milli(s)
SmartCreateNewRemoveManyPerformer: 112 milli(s)
FasterSmartCreateNewRemoveManyPerformer: 202 milli(s)
MagicRemoveManyPerformer: 74 milli(s)
ForwardInPlaceRemoveManyPerformer: 69 milli(s)
GuavaArrayListRemoveManyPerformer: 118 milli(s)

Example results for removing 1 item from 1,000,000 source items (first item is removed):

BetterNaiveRemoveManyPerformer: 34 milli(s)
LinkedRemoveManyPerformer: 41 milli(s)
CreateNewRemoveManyPerformer: 253 milli(s)
SmartCreateNewRemoveManyPerformer: 108 milli(s)
FasterSmartCreateNewRemoveManyPerformer: 71 milli(s)
MagicRemoveManyPerformer: 43 milli(s)
ForwardInPlaceRemoveManyPerformer: 73 milli(s)
GuavaArrayListRemoveManyPerformer: 78 milli(s)

Example results for removing 333,334 items from 1,000,000 source items:

BetterNaiveRemoveManyPerformer: 253206 milli(s)
LinkedRemoveManyPerformer: 69 milli(s)
CreateNewRemoveManyPerformer: 245 milli(s)
SmartCreateNewRemoveManyPerformer: 111 milli(s)
FasterSmartCreateNewRemoveManyPerformer: 203 milli(s)
MagicRemoveManyPerformer: 69 milli(s)
ForwardInPlaceRemoveManyPerformer: 72 milli(s)
GuavaArrayListRemoveManyPerformer: 102 milli(s)

Example results for removing 1,000,000 (all) items from 1,000,000 source items (all items are removed but with one-by-one processing, if you know a priori that all items are to be removed, list should be simply cleared):

BetterNaiveRemoveManyPerformer: 58 milli(s)
LinkedRemoveManyPerformer: 88 milli(s)
CreateNewRemoveManyPerformer: 95 milli(s)
SmartCreateNewRemoveManyPerformer: 91 milli(s)
FasterSmartCreateNewRemoveManyPerformer: 48 milli(s)
MagicRemoveManyPerformer: 61 milli(s)
ForwardInPlaceRemoveManyPerformer: 49 milli(s)
GuavaArrayListRemoveManyPerformer: 133 milli(s)

My final conclusions: use hybrid approach - if dealing with LinkedList - simple iteration and removal is best, if dealing with ArrayList - it depends if item order is important - use ForwardInPlaceRemoveManyPerformer then, if item order may be changed - best choice is MagicRemoveManyPerformer. If remove factor is known a priori (you know how many items will be removed vs retained) then some more conditionals may be put to select approach performing even better in particular situation. But known remove factor is not usual case... Google Guava Iterables.removeIf is such a hybrid solution but with slightly different assumption (original list must be changed, new one cannot be created and item order always matters) - these are most common assumptions so removeIf is best choice in most real-life cases.

Notice also that all good approaches (naive is not good!) are good enough - any one of them shold do just fine in real application, but naive approach must be avoided.

At last - my source code for testing.

package WildWezyrListRemovalTesting;

import com.google.common.base.Predicate;
import com.google.common.collect.Iterables;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;

public class RemoveManyFromList {

    public static abstract class BaseRemoveManyPerformer {

        protected String performerName() {
            return getClass().getSimpleName();
        }

        protected void info(String msg) {
            System.out.println(performerName() + ": " + msg);
        }

        protected void populateList(List<Integer> items, int itemCnt) {
            for (int i = 0; i < itemCnt; i++) {
                items.add(i);
            }
        }

        protected boolean mustRemoveItem(Integer itemVal, int itemIdx, int removeFactor) {
            if (removeFactor == 0) {
                return false;
            }
            return itemIdx % removeFactor == 0;
        }

        protected abstract List<Integer> removeItems(List<Integer> items, int removeFactor);

        protected abstract List<Integer> createInitialList();

        public void testMe(int itemCnt, int removeFactor) {
            List<Integer> items = createInitialList();
            populateList(items, itemCnt);
            long startMillis = System.currentTimeMillis();
            items = removeItems(items, removeFactor);
            long endMillis = System.currentTimeMillis();
            int chksum = 0;
            for (Integer item : items) {
                chksum += item;
            }
            info("removing took " + (endMillis - startMillis)
                    + " milli(s), itemCnt=" + itemCnt
                    + ", removed items: " + (itemCnt - items.size())
                    + ", remaining items: " + items.size()
                    + ", checksum: " + chksum);
        }
    }
    private List<BaseRemoveManyPerformer> rmps =
            new ArrayList<BaseRemoveManyPerformer>();

    public void addPerformer(BaseRemoveManyPerformer rmp) {
        rmps.add(rmp);
    }
    private Runtime runtime = Runtime.getRuntime();

    private void runGc() {
        for (int i = 0; i < 5; i++) {
            runtime.gc();
        }
    }

    public void testAll(int itemCnt, int removeFactor) {
        runGc();
        for (BaseRemoveManyPerformer rmp : rmps) {
            rmp.testMe(itemCnt, removeFactor);
        }
        runGc();
        System.out.println("\n--------------------------\n");
    }

    public static class NaiveRemoveManyPerformer
            extends BaseRemoveManyPerformer {

        @Override
        public List<Integer> removeItems(List<Integer> items, int removeFactor) {
            if (items.size() > 300000 && items instanceof ArrayList) {
                info("this removeItems is too slow, returning without processing");
                return items;
            }
            int i = 0;
            Iterator<Integer> iter = items.iterator();
            while (iter.hasNext()) {
                Integer item = iter.next();
                if (mustRemoveItem(item, i, removeFactor)) {
                    iter.remove();
                }
                i++;
            }
            return items;
        }

        @Override
        public List<Integer> createInitialList() {
            return new ArrayList<Integer>();
        }
    }

    public static class BetterNaiveRemoveManyPerformer
            extends NaiveRemoveManyPerformer {

        @Override
        public List<Integer> removeItems(List<Integer> items, int removeFactor) {
//            if (items.size() > 300000 && items instanceof ArrayList) {
//                info("this removeItems is too slow, returning without processing");
//                return items;
//            }

            for (int i = items.size(); --i >= 0;) {
                Integer item = items.get(i);
                if (mustRemoveItem(item, i, removeFactor)) {
                    items.remove(i);
                }
            }
            return items;
        }
    }

    public static class LinkedRemoveManyPerformer
            extends NaiveRemoveManyPerformer {

        @Override
        public List<Integer> createInitialList() {
            return new LinkedList<Integer>();
        }
    }

    public static class CreateNewRemoveManyPerformer
            extends NaiveRemoveManyPerformer {

        @Override
        public List<Integer> removeItems(List<Integer> items, int removeFactor) {
            List<Integer> res = createResultList(items, removeFactor);
            int i = 0;

            for (Integer item : items) {
                if (mustRemoveItem(item, i, removeFactor)) {
                    // no-op
                } else {
                    res.add(item);
                }
                i++;
            }

            return res;
        }

        protected List<Integer> createResultList(List<Integer> items, int removeFactor) {
            return new ArrayList<Integer>();
        }
    }

    public static class SmartCreateNewRemoveManyPerformer
            extends CreateNewRemoveManyPerformer {

        @Override
        protected List<Integer> createResultList(List<Integer> items, int removeFactor) {
            int newCapacity = removeFactor == 0 ? items.size()
                    : (int) (items.size() * (removeFactor - 1L) / removeFactor + 1);
            //System.out.println("newCapacity=" + newCapacity);
            return new ArrayList<Integer>(newCapacity);
        }
    }

    public static class FasterSmartCreateNewRemoveManyPerformer
            extends SmartCreateNewRemoveManyPerformer {

        @Override
        public List<Integer> removeItems(List<Integer> items, int removeFactor) {
            List<Integer> res = createResultList(items, removeFactor);

            for (int i = 0; i < items.size(); i++) {
                Integer item = items.get(i);
                if (mustRemoveItem(item, i, removeFactor)) {
                    // no-op
                } else {
                    res.add(item);
                }
            }

            return res;
        }
    }

    public static class ForwardInPlaceRemoveManyPerformer
            extends NaiveRemoveManyPerformer {

        @Override
        public List<Integer> removeItems(List<Integer> items, int removeFactor) {
            int j = 0; // destination idx
            for (int i = 0; i < items.size(); i++) {
                Integer item = items.get(i);
                if (mustRemoveItem(item, i, removeFactor)) {
                    // no-op
                } else {
                    if (j < i) {
                        items.set(j, item);
                    }
                    j++;
                }
            }

            return items.subList(0, j);
        }
    }

    public static class MagicRemoveManyPerformer
            extends NaiveRemoveManyPerformer {

        @Override
        public List<Integer> removeItems(List<Integer> items, int removeFactor) {
            for (int i = 0; i < items.size(); i++) {
                if (mustRemoveItem(items.get(i), i, removeFactor)) {
                    Integer retainedItem = removeSomeFromEnd(items, removeFactor, i);
                    if (retainedItem == null) {
                        items.remove(i);
                        break;
                    }
                    items.set(i, retainedItem);
                }
            }

            return items;
        }

        private Integer removeSomeFromEnd(List<Integer> items, int removeFactor, int lowerBound) {
            for (int i = items.size(); --i > lowerBound;) {
                Integer item = items.get(i);
                items.remove(i);
                if (!mustRemoveItem(item, i, removeFactor)) {
                    return item;
                }
            }
            return null;
        }
    }

    public static class GuavaArrayListRemoveManyPerformer
            extends BaseRemoveManyPerformer {

        @Override
        protected List<Integer> removeItems(List<Integer> items, final int removeFactor) {
            Iterables.removeIf(items, new Predicate<Integer>() {

                public boolean apply(Integer input) {
                    return mustRemoveItem(input, input, removeFactor);
                }
            });

            return items;
        }

        @Override
        protected List<Integer> createInitialList() {
            return new ArrayList<Integer>();
        }
    }

    public void testForOneItemCnt(int itemCnt) {
        testAll(itemCnt, 0);
        testAll(itemCnt, itemCnt);
        testAll(itemCnt, itemCnt - 1);
        testAll(itemCnt, 3);
        testAll(itemCnt, 2);
        testAll(itemCnt, 1);
    }

    public static void main(String[] args) {
        RemoveManyFromList t = new RemoveManyFromList();
        t.addPerformer(new NaiveRemoveManyPerformer());
        t.addPerformer(new BetterNaiveRemoveManyPerformer());
        t.addPerformer(new LinkedRemoveManyPerformer());
        t.addPerformer(new CreateNewRemoveManyPerformer());
        t.addPerformer(new SmartCreateNewRemoveManyPerformer());
        t.addPerformer(new FasterSmartCreateNewRemoveManyPerformer());
        t.addPerformer(new MagicRemoveManyPerformer());
        t.addPerformer(new ForwardInPlaceRemoveManyPerformer());
        t.addPerformer(new GuavaArrayListRemoveManyPerformer());

        t.testForOneItemCnt(1000);
        t.testForOneItemCnt(10000);
        t.testForOneItemCnt(100000);
        t.testForOneItemCnt(200000);
        t.testForOneItemCnt(300000);
        t.testForOneItemCnt(500000);
        t.testForOneItemCnt(1000000);
        t.testForOneItemCnt(10000000);
    }
}

WildWezyr 2010-01-12 11:02:40

It's great that you went to so much trouble to empirically test these approaches, so it hurts to have to inform you that these measurements are inconclusive.Getting a meaningful microbenchmark of Java code is extremely, extremely difficult. For every one thing you can do right, there are a hundred things you can do wrong, that will skew the results enormously.As just a sample, you need to run the test repeatedly for a good ~10 seconds or so *before* you start measuring. Each measurement should measure "many" repetitions. Continue measuring until the result stabilizes....

Kevin Bourrillion 2010-01-19 18:48:06

... Measure only *one* thing per VM invocation (this is important). Run each measurement many times -- you'll be surprised how inconsistent the results can be.This only scratches the surface; if you do all these things and follow dozens more pieces of advice, your benchmark results will most likely, like mine, still be of dubious significance. Such is the state of affairs with microbenchmarks in Java.

Kevin Bourrillion 2010-01-19 18:49:52

@Kevin: that is great that you too are aware of theoretical considerations of microbenchmark difficulties in Java ;-). but what is your point in my particular situation? what is wrong with my code or with my conslusions? please give better code or better conclusions. you already know that my conclusions are wrong? have you seen warm up stage in my code (it is quite hidden but it is there). please give more details, code fixes or just your better conclusions for my tests...

WildWezyr 2010-01-19 21:25:38

@Kevin: don't know why you assumed that my measurements was not stabilied (is it because i did not mention something wise about microbencharks in Java?). they were quite stable and there was little relative deviation as i was executing my tests for many times with different set of active approaches in each test. then, when i was quite sure that my results are stabilized, i gathered them and posted in my answer as my conclusions.

WildWezyr 2010-01-19 21:32:18

I don't have time to teach microbenchmarking 101, particularly because if anyone in the world were there to teach microbenchmarking 101, I'd be the first person to sign up to take the class. I don't know what works -- but I've sure seen enough of what *doesn't* work to be very skeptical of *any* java microbenchmark. I don't mean any discredit to you -- to paraphrase George Costanza, "it's not you, it's the JVM/etc."

Kevin Bourrillion 2010-01-20 18:07:45

Here are some links for your reading pleasure illustrating some of the pitfalls:http://wikis.sun.com/display/HotSpotInternals/MicroBenchmarkshttp://www.ibm.com/developerworks/java/library/j-jtp12214/http://www.ibm.com/developerworks/java/library/j-jtp02225.htmlhttp://wiki.jvmlangsummit.com/MindTheSemanticGap

Kevin Bourrillion 2010-01-20 18:09:14

@Kevin: I've already had "pleasure" reading articles from your links some time ago. The pleasure was even bigger because I had already known what is stated there. I do code optimization for about 10 years (different languages: Java, SQL statements etc.), so I must be familiar how to spot bottlenecks, benchmark optimizations etc. Thanks again for bunch of theoretical infos... But please - be constructive - give insights what is wrong in my code above, why exactly my conclusions are wrong, how to fix them.

WildWezyr 2010-01-21 10:06:28

Answer 12

A:

atk 2010-01-12 14:31:09

ansaurus

tags:

views:

answers:

How to efficiently (performance) remove many items from List in Java?

related questions