views:

362

answers:

9

Hello all!

Do you know of any time efficient way to remove duplicated values from a very big integer array using Java? The size of the array depends on the logged in user, but will always exceed 1500000 unsorted values with some duplicates. Every integer contains a number between 100000 and 9999999.

I tried converting it to a List, but the heap on my server doesn't allow this amount of data(my ISP has restricted it). And a regular for loop within a for loop takes over 5 minutes to calculate.

The size of the array without the duplicates is the one I will store in my database.

Help would be appreciated!

+3  A: 

I would make a hashset where I store all values contained in the list, before i start adding items to the list. Then just check so that the hashset doesn't contain the value you want to add.

Marcus Johansson
"I tried converting it to a List, but the heap on my server doesn't allow this amount of data" - that probably rules out Sets as well.
Péter Török
In my mind a list is a bit more wastefull with memory than a hashset, for large datasets. But i might be wrong. =/
Marcus Johansson
That largely depends on the list implementation. I believe `ArrayList` is more memory-efficient than `HashSet`, but I might be wrong too :-)
Péter Török
+4  A: 
Set<Integer> set = new HashSet<Integer>();
Collections.addAll(set, array);

you will just need an array of Integer[] instead of int[].

Bozho
"I tried converting it to a List, but the heap on my server doesn't allow this amount of data" - that probably rules out Sets as well.
Péter Török
Yes, that's more to the point. @user435140 note that this will only work if your array holds `Integer`'s, not primitive `int`'s.
Bart Kiers
@Bart K. good point
Bozho
+2  A: 

You can try sorting the array first:

int arr[] = yourarray;
Arrays.sort(arr);
// then iterate arr and remove duplicates
Pablo Santa Cruz
remove duplicates how?
Bozho
@Bozho he could iterate the array and count unique values. Apparently is the only thing he needs to do *...The size of the array without the duplicates is the one I will store in my database...*
Pablo Santa Cruz
By sorting first, you can then do a final traversal of the array and only keep one of each unique value. That should give a complexity of O(n log n) as opposed to O(n^2) for the double loop mentioned.
Mike Houston
Assuming you have sufficient resources to sort the thing in the first place!
dty
@Danny, Arrays.sort(...) does not use more space: it sorts "in place".
Bart Kiers
good, +1 (15chrs)
Bozho
@Bart K - this depends on your implementation, but the JDK doesn't guarantee an in-place sort. Many actually use a form of mergesort which requires O(n) extra space.
mikera
Yes, but you're assuming there's enough memory to load it in the first place. Or else, you'll have to implement a disk-sort algorithm. (Granted the OP says they already have an array - I'm just pointing out possible problems with the solution in general)
dty
@mikera, true. Next time I'll explicitly mention that I am talking about Sun/Orcale's JVM. Out of curiosity, what JVM does not sort a primitive array in place?
Bart Kiers
@Danny, from user435140's original post, I concluded that s/he already has the array and that the "converting to a list" produces the heap problem.
Bart Kiers
@mikera, in case you're still looking for those JVM implementations, don't bother :)
Bart Kiers
@Bart K - actually I was thinking about object / comparable sorts rather than primitives - you are right the current primitive sorts in JDK6/7 are in-place. Not sure I know any counter-examples for primitives but I certainly wouldn't rely on the behaviour....
mikera
A: 

Perhaps you could make a handful of passes over the data? For example, if you made ten passes over the data and applied one of the set suggestions above to a smaller subset of the data (say, when value mod pass# == 0). Thus:

for (int i = 0 to 9) {
  set = new Set()
  for (each entry in the data set) {
    if (entry % i == 0) {
      set.add(entry)
    }
  }
  output set
}

This way you will trade off time for memory (increase the number of passes for less memory/more time and vice-versa).

dty
+28  A: 

Also, you could perhaps use a bit set? I don't know how efficient Java's BitSet is. But 9999999 possible values would only take 9999999/8 = 1250000 bytes = just over 1Mb. As you walk the array of values, set the corresponding bit to true. Then you can walk over the bit set and output the corresponding value whenever you find a bit set to true.

1Mb will fit in a CPU cache, so this could be quite efficient depending on the bit set implementation.

This also has the side-effect of sorting the data too.

And... this is an O(n) algorithm since it requires a single pass over the input data, the set operations are O(1) (for an array-based set like this), and the output pass is also O(m) where m is the number of unique values and, by definition, must be <= n.

dty
clever :) worth a try
Bozho
+1 great answer.
YoK
Clever answers like these are the reason why I come to StackOverflow
Anthony Forloney
Excellent point. Thank you for the correction.
dty
A: 

Maybe a hash set that works with primitives instead of objects will do the job? There are free implementations (havn't used them before but maybe it works):

http://trove4j.sourceforge.net/

http://trove4j.sourceforge.net/javadocs/gnu/trove/TIntHashSet.html

Would then look like:

int[] newArray = new TIntHashSet(yourArray).toArray();
Arne
A: 
int[] a;
Arrays.sort(a);
int j = 0;
for (int i = 1; i < a.length; ++i) {
  if (a[i] != a[j]) {
    ++j;
    a[j] = a[i];
  }
}
// now store the elements from 0 to j (inclusive - i think)
Tom Anderson
If the result doesn't need to be sorted you can copy values from the "start" (which increments when copied) to reduce the number of copies. (one per duplicate instead of one per element)
Peter Lawrey
A: 

If you are sure, that integers have resonable small values (e.g. always more than zero and less than 1000 or 10000), you can try a trick like this:

    final int MAX = 100; 
    int[] arrayWithRepeats = {99, 0, 10, 99, 0, 11, 99};

    //we are counting here integers with the same value
    int [] arrayOfValues = new int[MAX+1];
    int countOfUniqueIntegers = 0;
    for(int i : arrayWithRepeats) {
        if(arrayOfValues[i] == 0) {
            countOfUniqueIntegers++;
        }
        arrayOfValues[i]++;
    }

    // you can use arrayOfValues (smaller) or convert it
    // to table of unique values (more usable)

    int[] arrayOfUniqueValues = new int[countOfUniqueIntegers];
    int index = 0;
    for(int i = 0; i<arrayOfValues.length; i++) {
        if(arrayOfValues[i] != 0) {
            arrayOfUniqueValues[index] = i;
            index++;
        }
    }

    //and now arrayOfUniqueValues is even sorted
    System.out.println( Arrays.toString(arrayOfUniqueValues) );

Output: [0, 10, 11, 99]

Jan Wegner
This is essentially the same as my bit set suggestion, except you're using 32 bits per entry instead of 1, so memory becomes an issue quite quickly. Also, the OP said that the values will be up to 9999999.
dty
Since "Every integer contains a number between 100000 and 9999999" this will not work.
emory
You are right. And good idea is to change arrayOfValues form int[] to BitSet as Danny's idea.
Jan Wegner
A: 

The truly desperate could write the array to disk and fork off sort | uniq | wc -l <infile.txt and capture the output. This would be needed if memory was still too tight or the domain space of integers got larger. I don't like this (is he even running unix!) but my point is that there are many ways to accomplish the task.

Another observation is that the minimum value is 100,000. So we could subtract 100,000 from the maximum value of 9,999,999, reducing the domain space and thus saving some memory. Perhaps 100k/8 bits is peanuts in the scheme of things, but it's essentially free to do it.

Tony Ennis