ansaurus

Question

How to search for duplicate values in a huge text file having around Half Million records

Answer 1

+3 A:

Keep a HashMap of {account_number, occurrences} in memory (initially empty), and traverse the file only once, setting or incrementing (in the HashMap) the number of occurrences of each account number you encounter during the traversal.

If you also have to print full information about the duplicate account numbers, then perform a second traversal of the input file, this time printing full details about each account number where the corresponding number of occurrences in the HashMap exceeded 1 during the previous traversal.

In terms of memory usage, even if all account numbers in a 500k-line-file are distinct you will only require roughly 1M integer storage (assuming account numbers are integers) plus HashMap overhead, which should all fit comfortably in a few megabytes of memory.

Cheers, V.

vladr 2010-04-08 05:15:36

Thanks V, I was much concerned on the memory usage w.r.t the above approach, Since as you say, the HashMap along with 500K records (int value) will fit in few MB's of memory, will go ahead with this approach.

Shibu 2010-04-08 07:36:34

ansaurus

tags:

views:

answers:

How to search for duplicate values in a huge text file having around Half Million records

related questions