views:

64

answers:

1

Can you please let me know on the quickest and efficient way to compare a large set of values. Its like there are a list of parent codes(string) and each code has a series of child values(string). The child lists have to be compared with each other and find out duplicates and count how many times they repeat.

code1(code1_value1, code1_value2, code3_value3, ..., code1_valueN);
code2(code2_value1, code1_value2, code2_value3, ..., code2_valueN);
code3(code2_value1, code3_value2, code3_value3, ..., code3_valueN);
    .
    .
    .
codeN(codeN_value1, codeN_value2, codeN_value3, ..., codeN_valueN);

The lists are huge say like there are 100 parent codes and each has about 250 values in them. There will not be duplicates within a code list. Doing it in java and the solution i could figure out is.

  • Store the values of first set of code in as codeMap.put(codeValue, duplicateCount). The count initialized to 0.
  • Then compare the rest of the values with this. If its in the map then increment the count otherwise append it to the map.

The downfall of this is to get the duplicates. Another iteration needs to be performed on a very large list.

An alternative is to maintain another hashmap for duplicates like duplicateCodeMap.put(codeValue, duplicateCount) and change the initial hashmap to codeMap.put(codeValue, codeValue).

Speed is what is requirement. Hope one of you can help me with it.

+1  A: 

You want to use a Map<String,Set<String>>, e.g. for each child code, what is the set of parent codes that has it.

That is, you want a Multimap, essentially, which is available from Guava.

Here's a sample to illustrate the idea:

import java.util.*;
public class MultiMap {
    public static void main(String[] args) {
        String[] codes = {
            "A=1,2,3,4",
            "B=1,3,5,9",
            "C=2,5,7,8",
        };
        Map<String,Set<String>> map = new HashMap<String,Set<String>>();
        Set<String> dupes = new HashSet<String>();
        for (String code : codes) {
            String parent = code.split("=")[0];
            for (String child : code.split("=")[1].split(",")) {
                Set<String> set = map.get(child);
                if (set == null) {
                    map.put(child, set = new HashSet<String>());
                } else {
                    dupes.add(child);
                }
                set.add(parent);
            }
        }
        System.out.println(map);
        // {3=[A, B], 2=[A, C], 1=[A, B], 7=[C], 5=[B, C], 4=[A], 9=[B], 8=[C]}
        for (String child : dupes) {
            System.out.println(child + "=" + map.get(child));
        }
        // 3=[A, B]
        // 2=[A, C]
        // 1=[A, B]
        // 5=[B, C]     
    }
}
polygenelubricants