views:

289

answers:

5

I am connecting to a sockets API that is very inflexible. It will return rows such as:

NAME, CITY, STATE, JOB, MONTH

But will have duplicates because it does not do any aggregation. I need to count the duplicate rows (which would be very easy in SQL, but not, as far as I know, in Java).

Example source data:

NAME,     CITY, STATE, JOB,         MONTH
John Doe, Denver, CO, INSTALLATION, 090301
John Doe, Denver, CO, INSTALLATION, 090301
John Doe, Denver, CO, INSTALLATION, 090301
Jane Doe, Phoenix, AZ, SUPPORT, 090301

Intended:

    NAME,    CITY, STATE,          JOB,  MONTH, COUNT
John Doe,  Denver,    CO, INSTALLATION, 090301,   3
Jane Doe, Phoenix,    AZ,      SUPPORT, 090301,   1

I can easily do this for approximately 100,000 return rows, but I am dealing with about 60 million in a month. Any ideas?

Edit: Unfortunately, the rows are not returned sorted... nor is there an option through the API to sort them. I get this giant mess of stuff that needs to be aggregated. Right now I use an ArrayList and do indexOf(new row) to find if the item already exists, but it gets slower the more rows that there are.

Edit: For clarification, this would only need to be run once a month, at the end of the month. Thank you for all of the responses

+1  A: 

Do you have the flexibility or is this an important enough a task to invest in something like Hadoop? With that size of data, you want to start thinking about it in terms of the "map-reducy" mindset.

Scanningcrew
A: 

Are the rows always returned sorted? ie. are the rows to be grouped always returned one after another? If the answer is yes:

1) Initialize a counter.

2) Keep track of the previous row that you just read and compare it to the current row. If it's the same, increment your counter. If it's different, record your row with the current counter value and reset the counter.

3) When you reach the last record, make sure to record the row with the current count.

This strategy will allow you to read in the large data sets in a stream and keep your program's memory footprint low while producing the more compact aggregate data you're after.

Asaph
+3  A: 

You could use a HashSet to store the previous row with the same contents. (assuming your Row objects have proper .hashValue() and .equals() methods implemented.

Something like this perhaps:

Set<Row> previousRows = new HashSet<Row>();
List<Row> rowsInOrder = new LinkedList<Row>();

Then in use (assuming further that you have an incrementCount() method to the Row class):

Row newRow = getNextRow();
if(!previousRows.contains(newRow)){
    previousRows.put(newRow);
    rowsInOrder.add(newRow);
} 
previousRows.get(newRow).incrementCount();

If you don't care about the order in which the rows came in, you can get rid of the List and just use the Set.

Suppressingfire
Almost exactly what I was going to suggest.
Skip Head
Considering the size of the data sets the OP needs to deal with, storing all the rows in a `HashSet` just might lead to an `OutOfMemoryException`.
Asaph
I will try this on Monday... It looks like what I was looking for. My biggest problem was that it got slower as it went along... a HashSet should be much faster...
jle
@Asaph, in the OP, jle mentions that they're already being stored in an ArrayList. Switching to a HashSet shouldn't introduce any significant memory overhead, and storing the references in both the HashSet and ArrayList shouldn't add much per-row overhead if order is required.If it's necessary to start looking at a solution which externalizes the data due to memory constraints, I'd suggest to use an embedded database like SQLite or Derby as an intermediate store.
Suppressingfire
I ended up using a Dictionary/HashTable instead of a HashSet, but you definitely had the right idea.
jle
+1  A: 

Are you able to fit all the data in memory at once? If you are putting it in an ArrayList, it sounds like you can.

If that is the case, you can just use an implementation of MultiSet, such as the one in Google collections

Then, you could just do insert all your rows into the multiset as follows

Multiset<Row> rowMultiset = HashMultiset.create();
for (Row row: rows) {
  rowMultiset.add(row);
}

And you can iterate through, with a count, using something like:

for (Multiset.Entry entry : rowMultiset.entrySet()) {
  System.out.println("row: "+entry.getElement()+", count: "+entry.getCount());
}

If you don't want to use an external library, you can do something similar using a HashMap mapping rows to integers.

If it is NOT the case that all your rows fit into memory, I think the simplest approach is just to insert the data into a database and do a query. Databases are designed and optimized for large datasets which don't fit into memory.

Chi
A: 

I can think of four ways to do this:

  • If you have enough memory to hold representations of 60 million rows in memory (less duplicates), use a HashMap<Row, Integer> to represent the counts.

  • Store the rows in an RDB, and then use SQL to aggregate and count.

  • Write the rows to a big file and use classical merge sort it before counting the rows in a single pass.

  • Use something like Hadoop to spread the rows across multiple machine.

The fact that you are expecting to be accumulating counts over the period of a month or more suggests that you need to consider the possibility that your application will need to be restarted. That suggests that an RDB or file-based solution is required.

Stephen C