ansaurus

Question

Java count of items in an array (similar to a SQL aggregate function)

Answer 1

+1 A:

Do you have the flexibility or is this an important enough a task to invest in something like Hadoop? With that size of data, you want to start thinking about it in terms of the "map-reducy" mindset.

Scanningcrew 2009-11-07 03:32:20

Answer 2

A:

Are the rows always returned sorted? ie. are the rows to be grouped always returned one after another? If the answer is yes:

1) Initialize a counter.

2) Keep track of the previous row that you just read and compare it to the current row. If it's the same, increment your counter. If it's different, record your row with the current counter value and reset the counter.

3) When you reach the last record, make sure to record the row with the current count.

This strategy will allow you to read in the large data sets in a stream and keep your program's memory footprint low while producing the more compact aggregate data you're after.

Asaph 2009-11-07 03:35:06

Answer 3

+3 A:

You could use a HashSet to store the previous row with the same contents. (assuming your Row objects have proper .hashValue() and .equals() methods implemented.

Something like this perhaps:

Set<Row> previousRows = new HashSet<Row>();
List<Row> rowsInOrder = new LinkedList<Row>();

Then in use (assuming further that you have an incrementCount() method to the Row class):

Row newRow = getNextRow();
if(!previousRows.contains(newRow)){
    previousRows.put(newRow);
    rowsInOrder.add(newRow);
} 
previousRows.get(newRow).incrementCount();

If you don't care about the order in which the rows came in, you can get rid of the List and just use the Set.

Suppressingfire 2009-11-07 04:23:22

Almost exactly what I was going to suggest.

Skip Head 2009-11-07 04:27:35

Considering the size of the data sets the OP needs to deal with, storing all the rows in a `HashSet` just might lead to an `OutOfMemoryException`.

Asaph 2009-11-07 04:30:56

I will try this on Monday... It looks like what I was looking for. My biggest problem was that it got slower as it went along... a HashSet should be much faster...

jle 2009-11-07 04:53:39

@Asaph, in the OP, jle mentions that they're already being stored in an ArrayList. Switching to a HashSet shouldn't introduce any significant memory overhead, and storing the references in both the HashSet and ArrayList shouldn't add much per-row overhead if order is required.If it's necessary to start looking at a solution which externalizes the data due to memory constraints, I'd suggest to use an embedded database like SQLite or Derby as an intermediate store.

Suppressingfire 2009-11-09 00:31:18

I ended up using a Dictionary/HashTable instead of a HashSet, but you definitely had the right idea.

jle 2009-11-10 02:25:17

Answer 4

+1 A:

Are you able to fit all the data in memory at once? If you are putting it in an ArrayList, it sounds like you can.

If that is the case, you can just use an implementation of MultiSet, such as the one in Google collections

Then, you could just do insert all your rows into the multiset as follows

Multiset<Row> rowMultiset = HashMultiset.create();
for (Row row: rows) {
  rowMultiset.add(row);
}

And you can iterate through, with a count, using something like:

for (Multiset.Entry entry : rowMultiset.entrySet()) {
  System.out.println("row: "+entry.getElement()+", count: "+entry.getCount());
}

If you don't want to use an external library, you can do something similar using a HashMap mapping rows to integers.

If it is NOT the case that all your rows fit into memory, I think the simplest approach is just to insert the data into a database and do a query. Databases are designed and optimized for large datasets which don't fit into memory.

Chi 2009-11-07 04:25:28

Answer 5

A:

I can think of four ways to do this:

If you have enough memory to hold representations of 60 million rows in memory (less duplicates), use a HashMap<Row, Integer> to represent the counts.
Store the rows in an RDB, and then use SQL to aggregate and count.
Write the rows to a big file and use classical merge sort it before counting the rows in a single pass.
Use something like Hadoop to spread the rows across multiple machine.

The fact that you are expecting to be accumulating counts over the period of a month or more suggests that you need to consider the possibility that your application will need to be restarted. That suggests that an RDB or file-based solution is required.

Stephen C 2009-11-07 08:39:59

ansaurus

tags:

views:

answers:

Java count of items in an array (similar to a SQL aggregate function)

related questions