views:

204

answers:

6

This method is a bottleneck in my program. (Or, it is at least where most of the time is being spent.)

Its purpose is to calculate the similarity between two strings based on the way characters occur in both, or occur in one or not the other.

The code is correct, but I would like to optimize it. Do you see any inefficiencies, bad practices, or unnecessary complexities?

(Don't worry about the Gene class. The real point of interest here are the Strings carried by the Genes.)

public static int simpleScore(Gene g0, Gene g1) {  
 if (g0.equals(g1)) { //identical genes have distance 0
  return 0;
 }

 String frag0 = g0.getDNA();
 String frag1 = g1.getDNA();

 Map<Character, Integer> charOccurencesFrag0 = new HashMap<Character, Integer>(Constants.DNA_CHARS.length);
 Map<Character, Integer> charOccurencesFrag1 = new HashMap<Character, Integer>(Constants.DNA_CHARS.length);

 int multSCOST = 0;
 int multDCOST = 0;
 int multDDCOST = 0;

 for (char c : Constants.DNA_CHARS) { //look through all acceptable DNA chars
  charOccurencesFrag0.put(c, StringUtils.countMatches(frag0, "" + c));
  charOccurencesFrag1.put(c, StringUtils.countMatches(frag1, "" + c));

  //if a char appears in one frag but not the other
  if ((frag0.indexOf(c) != -1 && frag1.indexOf(c) == -1) || (frag0.indexOf(c) == -1 && frag1.indexOf(c) != -1)) {
   multDDCOST += (charOccurencesFrag0.get(c) + charOccurencesFrag1.get(c));
  }

  //if a char appears in both frags
  if (frag0.indexOf(c) != -1 && frag1.indexOf(c) != -1) {
   int diff = Math.abs(charOccurencesFrag0.get(c) - charOccurencesFrag1.get(c));
   multDCOST += diff; 
   multSCOST += Math.min(charOccurencesFrag0.get(c), charOccurencesFrag1.get(c));
  }
 }

 return Constants.SCOST * multSCOST + Constants.DDCOST * multDDCOST + Constants.DCOST * multDCOST;
}

Updated Code, based on replies This has caused a considerable speedup. Thanks to all who commented. Anyone else have any more ideas?

public static int simpleScore(Gene g0, Gene g1) {  
 if (g0.equals(g1)) { //identical genes have distance 0
  return 0;
 }

 String frag0 = g0.getDNA();
 String frag1 = g1.getDNA();

 int[] charOccurencesFrag0 = countOccurrences(frag0, Constants.DNA_CHARS);
 int[] charOccurencesFrag1 = countOccurrences(frag1, Constants.DNA_CHARS);

 int multSCOST = 0;
 int multDCOST = 0;
 int multDDCOST = 0;

 for (int ket = 0; ket < Constants.DNA_CHARS.length; ket++) {

  //if a char appears in one frag but not the other
  if ((charOccurencesFrag0[ket] > 0 && charOccurencesFrag1[ket] == 0) || (charOccurencesFrag0[ket] == 0 && charOccurencesFrag1[ket] >0)){
   multDDCOST += charOccurencesFrag0[ket] + charOccurencesFrag1[ket];
  }

  //if a char appears in both frags
  if (charOccurencesFrag0[ket] != 0 && charOccurencesFrag1[ket] != 0){
   int diff = Math.abs(charOccurencesFrag0[ket] - charOccurencesFrag1[ket]);
   multDCOST += diff;      
   multSCOST += Math.min(charOccurencesFrag0[ket] , charOccurencesFrag1[ket]);
  }
 }

 return Constants.SCOST * multSCOST + Constants.DDCOST * multDDCOST + Constants.DCOST * multDCOST;
}

// from MAK on SO
private static int[] countOccurrences(String x, char[] validDNAChars){
 int[] count=new int[validDNAChars.length];
 for(int i=0;i<x.length();i++){
  int index=-1;
  for(int j=0;j<validDNAChars.length && index<0;j++){
   if (x.charAt(i)==validDNAChars[j]){
    index=j;
   }
  }
  if (index>=0) count[index]++;
 }
 return count;
}
+4  A: 

Well, a profiler would really be your best bet.

But, in my experience, for this kind of work a HashMap is much too slow. Especially for this domain, where there's a tiny vocabulary of possible characters, it would be better to use something like an array of ints, where the array is indexed by theChar - 'A' or something like that. (In other words, instead of occurrences.get(myChar), you'd say occurrences[myChar - 'A']).

Jonathan Feinberg
yeah, originally I just did an array like `occurences[char]`. But if a char as a value around 100, and there's only four chars in Constants.DNA_Chars, then you have 96 unused indices of the array. Isn't that inefficient?
Rosarch
If there are only four characters, then use an `enum` to represent the characters, and use (`enum Letter { A,T,C,G }`) and use the enum's `ordinal()` as an index into the array.
Jonathan Feinberg
what benefit does using an `enum` offer over a `final char[]`?
Rosarch
I don't see how you could use a `final char[]` to do what I'm suggesting. Perhaps you could sketch it out and provide the idea in a new answer?
Jonathan Feinberg
Perhaps I'm not sure what you're suggesting. In the "solution" above, I just used int[] as others have suggested. You think your way is better?
Rosarch
I'm responding to your first comment on this answer. You don't want to index by char value because that creates a sparse array. So you index by the ordinal() of an enum, which creates an array of 4, which isn't sparse.
Jonathan Feinberg
+4  A: 
  1. Profile!
  2. You are searching both String multiple times (every time you call indexOf). create two local bool variables and do the indexOf once only.
  3. don't do: charOccurencesFrag0.get(c), create a local int that holds the same value and refer to that instead of retrieving the value from a map multiple times.
Bartosz Radaczyński
I'm using Java Visual VM. Is there any way to use that to profile this?
Rosarch
http://java.sun.com/javase/6/docs/technotes/guides/visualvm/profiler.html
Bartosz Radaczyński
+2  A: 

If i'm not mis-understanding your code, you are only using the charOccurencesFrag*x* for the current character (hence all the the charOccurencesFrag*x*.get(c)) , couldn't you use a simple value for this char (one for frag and one for frag1), and remove the HashMaps completly ?

also, couldn't you use this value instead of the indexOf call (it's searching the string again...) and check if one or both values are 0 ?

Guillaume

PATRY
+3  A: 

Getting rid of HashMaps and the StringUtils.countMatches calls should give a noticeable speedup. I tried rewriting your code without them. Hard to know if it works without the rest of your code and sample data, but I think you will get the general idea.

public static int[] countOccurrences(String x, char[] validDNAChars){
    int[] count=new int[validDNAChars.length];
    for(int i=0;i<x.length();i++){
     int index=-1;
     for(int j=0;j<validDNAChars.length && index<0;j++){
      if (x.charAt(i)==validDNAChars[j]){
       index=j;
      }
     }
     if (index>=0) count[index]++;
    }
    return count;
}

public static int simpleScore(Gene g0, Gene g1) {               
       if (g0.equals(g1)) { //identical genes have distance 0
               return 0;
       }
        String frag0 = g0.getDNA();
        String frag1 = g1.getDNA();


       int[] charOccurencesFrag0=countOccurrences(frag0,Constants.DNA_CHARS);
        int[] charOccurencesFrag1=countOccurrences(frag1,Constants.DNA_CHARS);
        int multSCOST = 0;
        int multDCOST = 0;
        int multDDCOST = 0;
       //look through all acceptable DNA chars
        for(int i=0;i<Constants.DNA_CHARS.length;i++){
                //if a char appears in one frag but not the other
                if ((charOccurencesFrag0[i]>0 && charOccurencesFrag1[i]==0) || (charOccurencesFrag0[i]==0 && charOccurencesFrag1[i]>0)){
                        multDDCOST += charOccurencesFrag0[i]+charOccurencesFrag1[i];
                }

                //if a char appears in both frags
                if (charOccurencesFrag0[i]!=0 && charOccurencesFrag1[i]!=0){
                        int diff = Math.abs(charOccurencesFrag0[i]-charOccurencesFrag1[i]);
                        multDCOST += diff;      
                        multSCOST += Math.min(charOccurencesFrag0[i],charOccurencesFrag1[i]);
                }
        }

        return Constants.SCOST * multSCOST + Constants.DDCOST * multDDCOST + Constants.DCOST * multDCOST;
}
MAK
the profiler does show that StringUtil.countMatches takes up a lot of time. How would rolling my own be faster?
Rosarch
I don't think coding your own countMatches is going to be much faster than the one in StringUtil, certainly not for the single char searches used here. What I did was to not search for chars one by one but compute the counts for all chars in one pass.
MAK
A: 

Since you only have four different "things" to count and search for, consider mapping them to the integers zero to three in your logic. This allows you to use plain arrays instead of maps. Additionally your many indexOf might simplify.

Thorbjørn Ravn Andersen
I don't want to tie myself down to only four "things"...
Rosarch
Optimization usually exploits your knowledge of the domain. Here you know that you have relatively few valid characters which allow you to assume that a mapping to an index would be cheap.
Thorbjørn Ravn Andersen
or, use an enum. if you want to add things, add a letter to the enum, and instead of using magic numbers to set array sizes and the like, go off enum.values().length.
Carl
A: 

While it is OK to want to generalize to different 'DNA characters', you don't want this to affect performance. The following code uses a precomputed mapping to remove the need for an inner loop in countOccurrences.

static char[] validDNAChars = ...
static int[] charMapping = new int[128];  // (assumes DNA chars are ASCII)
static {
    for (int i = 0; i < charMapping.length; i++) {
        charMapping[i] = -1;
    }
    for (int j = 0; j < validDNAChars; j++) {
        charMapping[validDNAChars.charAt(j)] = j;
    }
}

...

private static int[] countOccurrences(String x){
    int[] counts = new int[validDNAChars.length];
    int len = x.length();
    for (int i = 0; i < x.length(); i++) {
            int index = charMapping[x.getChar(i)];
            if (index >= 0) count[index]++;
    }
    return counts;
}

Another possible optimization would be to make x a char[] rather than a String.

But it is likely that the biggest gain can be achieved by looking at the way that simpleScore is used. The simpleScore method compares two gene fragments, but I strongly suspect that you are using it to compare a set of X fragments against a set of Y where X and Y are significantly large. Assuming this is the case, you can get a massive speed up by reorganizing your code so that countOccurrences is called exactly once for each distinct fragment. This will reduces the complexity from O(N**2) to O(N) where N is max(X, Y). That's a massive speedup for large values of N.

Stephen C