views:

146

answers:

6

Hello,

I have a list of objects say, List. The Entity class has an equals method,on few attributes ( business rule ) to differentiate one Entity object from the other.

The task that we usually carry out on this list is to remove all the duplicates something like this :

List<Entity> noDuplicates = new ArrayList<Entity>();
for(Entity entity: lstEntities)
{
    int indexOf = noDuplicates.indexOf(entity);
    if(indexOf >= 0 )
    {
            noDuplicates.get(indexOf).merge(entity);
    }
    else
    {
            noDuplicates.add(entity);
     }
}

Now, the problem that I have been observing is that this part of the code, is slowing down considerably as soon as the list has objects more than 10000.I understand arraylist is doing a o(N) search.

Is there a faster alternative, using HashMap is not an option, because the entity's uniqueness is built upon 4 of its attributes together, it would be tedious to put in the key itself into the map ? will sorted set help in faster querying ?

Thanks

+2  A: 

An idea is to use a Set instead of a List, there are no duplicates in a Set. To remove duplicates from a list, you could just add the List to a new Set

List<Entity> list = //your list.
Set<Entity> set = new HashSet<Entitiy>();
set.addAll(list);

But then again, maybe there is some reason for using a List in the first place? If not, you could use a Set instead, and not have to worry about any duplicates.

EDIT

There is no index reference of the elements in a Set (as compared to a List, where you can do get(int index)). The elements in a Set are floating around without a specific point of reference.

If you need to find a specific one, you need to iterate through them all. If that is not ok and/or you can't be without the indexed reference - that allows for get(int index) and remove(int index) - I guess Set is not an option for you.

Lars Andren
Using a set , will not help during the insertion right, if i try to add in a duplicate, it will not allow me to , then i need to query that object using contains() and get() probably . Is that what you meant ? if yes how fast is get() on the set ?
panzerschreck
There is no get() on a Set. There are add(Object o) and remove(Object o). If you try to add a duplicate to the Set, add(Object o) will return false.
Lars Andren
Then it really won't work for the code posted, will it? He needs to do that `merge` operation, and this won't let him.
Daniel Martin
+3  A: 

Instead of a list structure, you could use a set (more appropriate if you are concerned about Entity uniqueness), as Lars has suggested. Additionally, if performance is a problem, I'd look at using a TreeSet and implement a Comparator for comparing Entity instances based on their attributes. The tree structure will allow for fast (logarithmic complexity) insert, delete and retrieval operations.

sharky
If you think a Map structure with a hash is not feasible, then this is probably the best answer. Your current call to `noDuplicates.indexOf(entity)` will have worst case performance of `O(N)` whereas a call to `TreeSet.contains()` can guarantee you `O(log(N))` performance. With a little effort on the `Comparator`, you can make it use your existing `Entity.equals` method too. (@rati: this is pretty much what you said...just adding more details)
Brent Nash
+1  A: 

It all depends on what that merge operation is doing. Does merge change any of the attributes that are compared when you do equals? If not, then you'll be amazed at how much faster it will be if you do this:

First, define a hashCode for your Entity class that is compatible with your definition of equals. One common way to do this is:

public int hashCode() {
  // assuming the four attributes that determine equality are called
  // attrFoo, attrBar, attrBaz, and attrQux
  int hash = 1;
  hash += attrFoo == null ? 0 : attrFoo.hashCode();
  hash *= 37;
  hash += attrBar == null ? 0 : attrBar.hashCode();
  hash *= 37;
  hash += attrBaz == null ? 0 : attrBaz.hashCode();
  hash *= 37;
  hash += attrQux == null ? 0 : attrQux.hashCode();

  return hash;
}

Then, use a HashMap so that you can find these things:

Map<Entity, Entity> map = new HashMap<Entity, Entity>();
for(Entity entity: lstEntities) {
  if (map.containsKey(entity)) {
    map.get(entity).merge(entity);
  } else {
    map.put(entity, entity);
  }
}
return map.values();  // or keys().  Whichever.

I should note that I feel a bit dirty writing the above code, because you really shouldn't make Map keys that aren't immutable, but this will work and be much, much faster than what you're doing now.

Daniel Martin
this will indeed cause issues if the fields used in `Entity.hashCode()` are affected by the `merge` operation
matt b
You might consider using a HashSet instead of a HashMap. It will automatically filter out the duplicates for you, so you could skip the `"if(map.containsKey(entity))"` check. Cleaner code and same algorithmic complexity.
Brent Nash
@Brent Nash: but that won't let him ever call `merge` on the entity that's stored in the structure. He needs to do that (apparently).
Daniel Martin
@Brent a Set won't allow you to get the second Entity instance with the same "identity" as the first Entity instance
matt b
@Daniel @Matt Totally correct on both counts. Missed that on my first reading. Thanks!
Brent Nash
+2  A: 

Now, the problem that I have been observing is that this part of the code, is slowing down considerably as soon as the list has objects more than 10000.I understand arraylist is doing a o(N) search.

The algorithm you posted is actually worse than O(N)

  • Iterating through the input list lstEntities - O(N)
  • within this loop, you are calling ArrayList.indexOf(T) which has to scan the list - O(N) again

You algorithm is actually O(N^2) since you are potentially scanning the list twice within a loop.

It sounds like you what you want to do is actually two operations:

  1. From the input List, remove any duplicates
  2. When you find duplicates, "merge" the entities.

You can do this by scanning the list just once, rather than in nested loops. I would recommend breaking up your Entity to move the fields that "identify" an Entity into another type, such as ID, or at the very least add a getID() method which can return these fields grouped into a single type. This way you can easily build a Map between the two types to be able to merge entities with "duplicate" identities. This might look something like this:

Map<ID, Entity> map = new HashMap<ID, Entity>(inputList.size());
for (Entity e : inputList) {
    Entity existing = map.get(e.getID());
    if (existing == null) {
        //not in map, add it
        map.put(e.getID(), e);
    } 
    else {
        existing.merge(e);
    }
}

Iterating through the list is O(n) while HashMap.get(K) is a constant-time operation.

matt b
Isn't this essentially the option the poster ruled out with his statement "using HashMap is not an option, because the entity's uniqueness is built upon 4 of its attributes together, it would be tedious to put in the key itself into the map"? I think that statement is ridiculous, but since it's in the question it should be explicitly rebutted if you're going to go against it.
Daniel Martin
Agreed with @Daniel. Also keep in mind `HashMap.get()` is only `O(1)` if you have a good hash function. With potentially 1000s of Entity objects that might be tough since @panzerschreck will have to write his/her own hashCode method.
Brent Nash
@Daniel, good point, I missed that. Ok here is my rebuttal: 1) it is trivial to write a `EntityID` type which holds those four attributes and properly implements equals() and hashcode() (use commons-lang for simplicity) 2) it is trivial to add a getID() method to `Entity` which constructs a new `EntityID` instance for the four attributes that form the "identity" 3) the amount of work in #1 and #2 (one class, three methods) is worth the amount of computation you'll save in turning a O(N^2) algorithm to O(N).
matt b
I mean, it sounds like the best solution is being ruled out because of not wanting to write new code
matt b
Great, I will want to try this approach.Sounds interesting.
panzerschreck
I did not observe any noticeable change, I cannot understand what happens after this magic number 10000 , things start to get slower, and slower..
panzerschreck
@panzerschreck, do you have any sort of logging or benchmarking to tell you which operations (iterating, retrieving the key, merging) might be "slowing down" more than others? Is it possible that the logic in `Entity.merge()` increases as you "merge" more and more instances into one?
matt b
Yes, I do have log4j, I have observed that the very same loop where i try to eliminate duplicates becomes slower and slower..
panzerschreck
@panzerschreck, meaning the loop i posted above that iterates through the list?
matt b
Guys,thanks for the feedback. I noticed the merge method calling on some other methods, that wasn't "smartly" written. Once that was fixed, I notice an improvement in the performance.
panzerschreck
A: 

Unless you have a reason for needing the ordering of a List, you'd probably be best off with a Set - specifically, a HashSet.

I see your concern about using a hashed collection because "the entity's uniqueness is built upon 4 of its attributes together", but that's easily overcome. You just have to define a hashcode() method which is compatible with your existing equals() method, and then you can insert your entities into a Set, and as a magic side effect, never have to remove duplicates again.

CPerkins
A: 

Two simple steps for an O(N*Log(N)) algorithm:

  1. Sort the list using a comparator based on the four important fields
  2. Iterate over the list comparing each item to the next in the list, if they are equal, merge them and remove one.
Tim Bender