ansaurus

Question

Java: Calculating a metric of String simliarity

Answer 1

+4 A:

Well, a profiler would really be your best bet.

But, in my experience, for this kind of work a HashMap is much too slow. Especially for this domain, where there's a tiny vocabulary of possible characters, it would be better to use something like an array of ints, where the array is indexed by theChar - 'A' or something like that. (In other words, instead of occurrences.get(myChar), you'd say occurrences[myChar - 'A']).

Jonathan Feinberg 2009-10-17 19:09:37

yeah, originally I just did an array like `occurences[char]`. But if a char as a value around 100, and there's only four chars in Constants.DNA_Chars, then you have 96 unused indices of the array. Isn't that inefficient?

Rosarch 2009-10-17 19:30:27

If there are only four characters, then use an `enum` to represent the characters, and use (`enum Letter { A,T,C,G }`) and use the enum's `ordinal()` as an index into the array.

Jonathan Feinberg 2009-10-17 19:35:32

what benefit does using an `enum` offer over a `final char[]`?

Rosarch 2009-10-17 21:18:45

I don't see how you could use a `final char[]` to do what I'm suggesting. Perhaps you could sketch it out and provide the idea in a new answer?

Jonathan Feinberg 2009-10-17 21:20:05

Perhaps I'm not sure what you're suggesting. In the "solution" above, I just used int[] as others have suggested. You think your way is better?

Rosarch 2009-10-17 21:40:05

I'm responding to your first comment on this answer. You don't want to index by char value because that creates a sparse array. So you index by the ordinal() of an enum, which creates an array of 4, which isn't sparse.

Jonathan Feinberg 2009-10-17 22:01:23

Answer 2

+4 A:

Profile!
You are searching both String multiple times (every time you call indexOf). create two local bool variables and do the indexOf once only.
don't do: charOccurencesFrag0.get(c), create a local int that holds the same value and refer to that instead of retrieving the value from a map multiple times.

Bartosz Radaczyński 2009-10-17 19:11:47

I'm using Java Visual VM. Is there any way to use that to profile this?

Rosarch 2009-10-17 19:34:01

http://java.sun.com/javase/6/docs/technotes/guides/visualvm/profiler.html

Bartosz Radaczyński 2009-10-17 19:42:35

Answer 3

+2 A:

If i'm not mis-understanding your code, you are only using the charOccurencesFrag*x* for the current character (hence all the the charOccurencesFrag*x*.get(c)) , couldn't you use a simple value for this char (one for frag and one for frag1), and remove the HashMaps completly ?

also, couldn't you use this value instead of the indexOf call (it's searching the string again...) and check if one or both values are 0 ?

Guillaume

PATRY 2009-10-17 19:16:09

Answer 4

+3 A:

Getting rid of HashMaps and the StringUtils.countMatches calls should give a noticeable speedup. I tried rewriting your code without them. Hard to know if it works without the rest of your code and sample data, but I think you will get the general idea.

public static int[] countOccurrences(String x, char[] validDNAChars){
    int[] count=new int[validDNAChars.length];
    for(int i=0;i<x.length();i++){
     int index=-1;
     for(int j=0;j<validDNAChars.length && index<0;j++){
      if (x.charAt(i)==validDNAChars[j]){
       index=j;
      }
     }
     if (index>=0) count[index]++;
    }
    return count;
}

public static int simpleScore(Gene g0, Gene g1) {               
       if (g0.equals(g1)) { //identical genes have distance 0
               return 0;
       }
        String frag0 = g0.getDNA();
        String frag1 = g1.getDNA();


       int[] charOccurencesFrag0=countOccurrences(frag0,Constants.DNA_CHARS);
        int[] charOccurencesFrag1=countOccurrences(frag1,Constants.DNA_CHARS);
        int multSCOST = 0;
        int multDCOST = 0;
        int multDDCOST = 0;
       //look through all acceptable DNA chars
        for(int i=0;i<Constants.DNA_CHARS.length;i++){
                //if a char appears in one frag but not the other
                if ((charOccurencesFrag0[i]>0 && charOccurencesFrag1[i]==0) || (charOccurencesFrag0[i]==0 && charOccurencesFrag1[i]>0)){
                        multDDCOST += charOccurencesFrag0[i]+charOccurencesFrag1[i];
                }

                //if a char appears in both frags
                if (charOccurencesFrag0[i]!=0 && charOccurencesFrag1[i]!=0){
                        int diff = Math.abs(charOccurencesFrag0[i]-charOccurencesFrag1[i]);
                        multDCOST += diff;      
                        multSCOST += Math.min(charOccurencesFrag0[i],charOccurencesFrag1[i]);
                }
        }

        return Constants.SCOST * multSCOST + Constants.DDCOST * multDDCOST + Constants.DCOST * multDCOST;
}

MAK 2009-10-17 20:13:10

the profiler does show that StringUtil.countMatches takes up a lot of time. How would rolling my own be faster?

Rosarch 2009-10-17 21:12:00

I don't think coding your own countMatches is going to be much faster than the one in StringUtil, certainly not for the single char searches used here. What I did was to not search for chars one by one but compute the counts for all chars in one pass.

MAK 2009-10-18 15:29:44

Answer 5

A:

Since you only have four different "things" to count and search for, consider mapping them to the integers zero to three in your logic. This allows you to use plain arrays instead of maps. Additionally your many indexOf might simplify.

Thorbjørn Ravn Andersen 2009-10-17 20:44:12

I don't want to tie myself down to only four "things"...

Rosarch 2009-10-17 21:13:20

Optimization usually exploits your knowledge of the domain. Here you know that you have relatively few valid characters which allow you to assume that a mapping to an index would be cheap.

Thorbjørn Ravn Andersen 2009-10-17 21:47:48

or, use an enum. if you want to add things, add a letter to the enum, and instead of using magic numbers to set array sizes and the like, go off enum.values().length.

Carl 2009-10-17 22:13:46

Answer 6

A:

While it is OK to want to generalize to different 'DNA characters', you don't want this to affect performance. The following code uses a precomputed mapping to remove the need for an inner loop in countOccurrences.

static char[] validDNAChars = ...
static int[] charMapping = new int[128];  // (assumes DNA chars are ASCII)
static {
    for (int i = 0; i < charMapping.length; i++) {
        charMapping[i] = -1;
    }
    for (int j = 0; j < validDNAChars; j++) {
        charMapping[validDNAChars.charAt(j)] = j;
    }
}

...

private static int[] countOccurrences(String x){
    int[] counts = new int[validDNAChars.length];
    int len = x.length();
    for (int i = 0; i < x.length(); i++) {
            int index = charMapping[x.getChar(i)];
            if (index >= 0) count[index]++;
    }
    return counts;
}

Another possible optimization would be to make x a char[] rather than a String.

But it is likely that the biggest gain can be achieved by looking at the way that simpleScore is used. The simpleScore method compares two gene fragments, but I strongly suspect that you are using it to compare a set of X fragments against a set of Y where X and Y are significantly large. Assuming this is the case, you can get a massive speed up by reorganizing your code so that countOccurrences is called exactly once for each distinct fragment. This will reduces the complexity from O(N**2) to O(N) where N is max(X, Y). That's a massive speedup for large values of N.

Stephen C 2009-10-18 09:58:03

ansaurus

tags:

views:

answers:

Java: Calculating a metric of String simliarity

related questions