views:

149

answers:

1

I have an input txt file which has data in the form of records (each row is a record and represents more or less like a DB table) and I need to find for duplicate values. For example:

Rec1: ACCOUNT_NBR_1*NAME_1*VALUE_1
Rec2: ACCOUNT_NBR_2*NAME_2*VALUE_2
Rec3: ACCOUNT_NBR_1*NAME_3*VALUE_3

In the above set, the Rec1 and Rec2 are considered to be duplicates as the ACCOUNT NUMBERS are same(ACCOUNT_NBR1).

Note: The input file shown above is a delimiter type file (the delimiter being *) however the file type can also be a fixed length file where each column starts and ends with a specified positions.

I am currently doing this with the following logic:

Loop thru each ACCOUNT NUMBER
  Loop thru each line of the txt file and record and check if this is repeated.
  If repeated record the same in a hashtable.
  End 
End

And I am using 'Pattern' & 'BufferedReader' java API's to perform the above task.

But since it is taking a long time, I would like to know a better way of handling it.

Thanks, Shibu

+3  A: 

Keep a HashMap of {account_number, occurrences} in memory (initially empty), and traverse the file only once, setting or incrementing (in the HashMap) the number of occurrences of each account number you encounter during the traversal.

If you also have to print full information about the duplicate account numbers, then perform a second traversal of the input file, this time printing full details about each account number where the corresponding number of occurrences in the HashMap exceeded 1 during the previous traversal.

In terms of memory usage, even if all account numbers in a 500k-line-file are distinct you will only require roughly 1M integer storage (assuming account numbers are integers) plus HashMap overhead, which should all fit comfortably in a few megabytes of memory.

Cheers, V.

vladr
Thanks V, I was much concerned on the memory usage w.r.t the above approach, Since as you say, the HashMap along with 500K records (int value) will fit in few MB's of memory, will go ahead with this approach.
Shibu