views:

249

answers:

6

I have a program which fetches records from database (using Hibernate) and fills them in a Vector. There was an issue regarding the performance of the operation and I did a test with the Vector replaced by a HashSet. With 300000 records, the speed gain is immense - 45 mins to 2 mins!

So my question is, what is causing this huge difference? Is it just the point that all methods in Vector are synchronized or the point that internally Vector uses an array whereas HashSet does not? Or something else?

The code is running in a single thread.

EDIT: The code is only inserting the values in the Vector (and in the other case, HashSet).

+1  A: 

Vector is synchronized by default; HashSet is not. That's my guess. Obtaining a monitor for access takes time.

I don't know if there are reads in your test, but Vector and HashSet are both O(1) if get() is used to access Vector entries.

duffymo
+6  A: 

If it's trying to use the Vector as a set, and checking for the existence of a record before adding it, then filling the vector becomes an O(n^2) operation, compared with O(n) for HashSet. It would also become an O(n^2) operation if you insert each element at the start of the vector instead of at the end.

If you're just using collection.add(item) then I wouldn't expect to see that sort of difference - synchronization isn't that slow.

If you can try to test it with different numbers of records, you could see how each version grows as n increases - that would make it easier to work out what's going on.

EDIT: If you're just using Vector.add then it sounds like something else could be going on - e.g. your database was behaving differently between your different test runs. Here's a little test application:

import java.util.*;

public class Test {
  public static void main(String[] args) {
    long start = System.currentTimeMillis();
    Vector<String> vector = new Vector<String>();
    for (int i = 0; i < 300000; i++) {
      vector.add("dummy value");
    }
    long end = System.currentTimeMillis();
    System.out.println("Time taken: " + (end - start) + "ms");
  }
}

Output:

Time taken: 38ms

Now obviously this isn't going to be very accurate - System.currentTimeMillis isn't the best way of getting accurate timing - but it's clearly not taking 45 minutes. In other words, you should look elsewhere for the problem, if you really are just calling Vector.add(item).

Now, changing the code above to use

vector.add(0, "dummy value"); // Insert item at the beginning

makes an enormous difference - it takes 42 seconds instead of 38ms. That's clearly a lot worse - but it's still a long way from being 45 minutes - and I doubt that my desktop is 60 times as fast as yours.

Jon Skeet
There is no explicit check for uniqueness. Just inserting the values.
abhin4v
@abhin4v: Inserting the values how, exactly? Please post code.
Jon Skeet
@Jon Skeet: Some of the extra time can be explained by string pooling and/or his object just having a mean constructor.
DeadMG
@DeadMG: Well, there's 43 minutes of difference between Vector and HashSet, supposedly... and those cases would still need to do the same amount of constructor work.
Jon Skeet
@Jon Skeet: That's true, I guess.
DeadMG
I double checked the code and found that it indeed is doing a `!Vector.contains` before adding the value. My apologies for hastily putting the last comment. So the problem is that it is trying to use a `Vector` as a `Set`.
abhin4v
@abhin4v ... GRRRRRRR!
Stephen C
thats why i suggested to use a profiler, you would have seen instantly that the contains method takes about 95% of the time.
atamanroman
+2  A: 

If you are inserting them at the middle or beginning instead of at the end, then the Vector needs to move them all along. Every insert. The hashmap, on the other hand, doesn't really care or have to do anything.

DeadMG
+2  A: 

Vector is outdated and should not be used anymore. Profile with ArrayList or LinkedList (depends on how you use the list) and you will see the difference (sync vs unsync). Why are you using Vector in a single threaded application at all?

atamanroman
btw this is a guess since you provided no code, you should profile your tool so you can see whats so damn slow :D
atamanroman
The code was not written by me. Woes of maintaining old code...
abhin4v
+1  A: 

Under normal circumstances, it is totally implausible that inserting 300,000 records into a Vector will take 43 minutes longer than inserting the same records into a HashSet.

However, I think there is a possible explanation of what might be going on.

First, the records coming out of the database must have a very high proportion of duplicates. Or at least, they must be duplicates according to the semantics of the equals/hashcode methods of your record class.

Next, I think you must be pushing very close to filling up the heap.

So the reason that the HashSet solution is so much faster is that it is most of the records are being replaced by the set.add operation. By contrast the Vector solution is keeping all of the records, and the JVM is spending most of its time trying to squeeze that last 0.05% of memory by running the GC over, and over and over.

One way to test this theory is to run the Vector version of the application with a much bigger heap.


Irrespective, the best way to investigate this kind of problem is to run the application using a profiler, and see where all the CPU time is going.

Stephen C
A: 

According to Dr Heinz Kabutz, he said this in one of his newsletters.

The old Vector class implements serialization in a naive way. They simply do the default serialization, which writes the entire Object[] as-is into the stream. Thus if we insert a bunch of elements into the List, then clear it, the difference between Vector and ArrayList is enormous.

import java.util.*;
import java.io.*;

public class VectorWritingSize {
  public static void main(String[] args) throws IOException {
    test(new LinkedList<String>());
    test(new ArrayList<String>());
    test(new Vector<String>());
  }

  public static void test(List<String> list) throws IOException {
    insertJunk(list);
    for (int i = 0; i < 10; i++) {
      list.add("hello world");
    }
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    ObjectOutputStream out = new ObjectOutputStream(baos);
    out.writeObject(list);
    out.close();
    System.out.println(list.getClass().getSimpleName() +
        " used " + baos.toByteArray().length + " bytes");
  }

  private static void insertJunk(List<String> list) {
    for(int i = 0; i<1000 * 1000; i++) {
      list.add("junk");
    }
    list.clear();
  }
}

When we run this code, we get the following output:

LinkedList used 107 bytes
ArrayList used 117 bytes
Vector used 1310926 bytes

Vector can use a staggering amount of bytes when being serialized. The lesson here? Don't ever use Vector as Lists in objects that are Serializable. The potential for disaster is too great.

Shervin
Dont ever use Vector at all, its only there for legacy reasons.
atamanroman
This answer is completely irrelevant. It concerns serialization, which isn't the topic.
EJP