This method is a bottleneck in my program. (Or, it is at least where most of the time is being spent.)
Its purpose is to calculate the similarity between two strings based on the way characters occur in both, or occur in one or not the other.
The code is correct, but I would like to optimize it. Do you see any inefficiencies, bad practices, or unnecessary complexities?
(Don't worry about the Gene class. The real point of interest here are the Strings carried by the Genes.)
public static int simpleScore(Gene g0, Gene g1) {
if (g0.equals(g1)) { //identical genes have distance 0
return 0;
}
String frag0 = g0.getDNA();
String frag1 = g1.getDNA();
Map<Character, Integer> charOccurencesFrag0 = new HashMap<Character, Integer>(Constants.DNA_CHARS.length);
Map<Character, Integer> charOccurencesFrag1 = new HashMap<Character, Integer>(Constants.DNA_CHARS.length);
int multSCOST = 0;
int multDCOST = 0;
int multDDCOST = 0;
for (char c : Constants.DNA_CHARS) { //look through all acceptable DNA chars
charOccurencesFrag0.put(c, StringUtils.countMatches(frag0, "" + c));
charOccurencesFrag1.put(c, StringUtils.countMatches(frag1, "" + c));
//if a char appears in one frag but not the other
if ((frag0.indexOf(c) != -1 && frag1.indexOf(c) == -1) || (frag0.indexOf(c) == -1 && frag1.indexOf(c) != -1)) {
multDDCOST += (charOccurencesFrag0.get(c) + charOccurencesFrag1.get(c));
}
//if a char appears in both frags
if (frag0.indexOf(c) != -1 && frag1.indexOf(c) != -1) {
int diff = Math.abs(charOccurencesFrag0.get(c) - charOccurencesFrag1.get(c));
multDCOST += diff;
multSCOST += Math.min(charOccurencesFrag0.get(c), charOccurencesFrag1.get(c));
}
}
return Constants.SCOST * multSCOST + Constants.DDCOST * multDDCOST + Constants.DCOST * multDCOST;
}
Updated Code, based on replies This has caused a considerable speedup. Thanks to all who commented. Anyone else have any more ideas?
public static int simpleScore(Gene g0, Gene g1) {
if (g0.equals(g1)) { //identical genes have distance 0
return 0;
}
String frag0 = g0.getDNA();
String frag1 = g1.getDNA();
int[] charOccurencesFrag0 = countOccurrences(frag0, Constants.DNA_CHARS);
int[] charOccurencesFrag1 = countOccurrences(frag1, Constants.DNA_CHARS);
int multSCOST = 0;
int multDCOST = 0;
int multDDCOST = 0;
for (int ket = 0; ket < Constants.DNA_CHARS.length; ket++) {
//if a char appears in one frag but not the other
if ((charOccurencesFrag0[ket] > 0 && charOccurencesFrag1[ket] == 0) || (charOccurencesFrag0[ket] == 0 && charOccurencesFrag1[ket] >0)){
multDDCOST += charOccurencesFrag0[ket] + charOccurencesFrag1[ket];
}
//if a char appears in both frags
if (charOccurencesFrag0[ket] != 0 && charOccurencesFrag1[ket] != 0){
int diff = Math.abs(charOccurencesFrag0[ket] - charOccurencesFrag1[ket]);
multDCOST += diff;
multSCOST += Math.min(charOccurencesFrag0[ket] , charOccurencesFrag1[ket]);
}
}
return Constants.SCOST * multSCOST + Constants.DDCOST * multDDCOST + Constants.DCOST * multDCOST;
}
// from MAK on SO
private static int[] countOccurrences(String x, char[] validDNAChars){
int[] count=new int[validDNAChars.length];
for(int i=0;i<x.length();i++){
int index=-1;
for(int j=0;j<validDNAChars.length && index<0;j++){
if (x.charAt(i)==validDNAChars[j]){
index=j;
}
}
if (index>=0) count[index]++;
}
return count;
}