views:

178

answers:

3

Given 2 strings, I want to find the first match of at least four characters.

This is the code I currently have to do it. It works correctly, but I'm thinking there may be a better way to do this. Are there any screaming inefficiencies or bad practices in what I'm doing? Are there common libraries, like Apache Commons, that I should be taking advantage of but I'm not?

Don't worry about the Gene class - it just contains the String in question. Also - GeneMatch() signifies no match exists, whereas the GeneMatch constructor with arguments signifies a match has been found.

Constants.MIN_MATCH == 4, in this case.

public static GeneMatch findMatch(Gene g0, Gene g1) {

 String g0DNA = g0.getDNA();
 String g1DNA = g1.getDNA();

 if (g0DNA.equals("") || g1DNA.equals("")) { //there won't be a match if one is empty
  return new GeneMatch();
 }

 int g0Left = -1;
 int g0Right = -1;
 int g1Left = -1;
 int g1Right = -1;

 String window;

 for (int inx = 0; inx <= g0DNA.length() - Constants.MIN_MATCH; inx++) {
  window = g0DNA.substring(inx, inx + Constants.MIN_MATCH);

  if (g1DNA.indexOf(window) != -1) {

   g0Left = inx;
   g0Right = inx + Constants.MIN_MATCH;

   g1Left = g1DNA.indexOf(window);
   g1Right = g1Left + Constants.MIN_MATCH;

   /* grow the match to the right
    * while the two right indices are less than the lengths of their respective strings, and the 
    * characters at the indices match, increment each index
    */
   while (g0Right < g0DNA.length() && g1Right < g1DNA.length() && g0DNA.charAt(g0Right) == g1DNA.charAt(g1Right)) {
    g0Right++;
    g1Right++;
   }
   break; //we've already found a match, no need to continue sliding the window
  }
 }

 //now that the indices are found, convert to Genes
 if (g0Left == -1 || g0Right == -1 || g1Left == -1 || g1Right == -1) { //no match found
  return new GeneMatch();
 }

 Gene gL0 = new Gene(g0DNA.substring(0, g0Left));
 Gene gL1 = new Gene(g1DNA.substring(0, g1Left));

 Gene g0match = new Gene(g0DNA.substring(g0Left, g0Right));
 Gene g1match = new Gene(g1DNA.substring(g1Left, g1Right));

 Gene gR0 = new Gene(g0DNA.substring(g0Right));
 Gene gR1 = new Gene(g1DNA.substring(g1Right));

 //sanity check
 assert g0DNA.equals(gL0.getDNA() + g0match.getDNA() + gR0.getDNA()) : "g0 didn't add up";
 assert g1DNA.equals(gL1.getDNA() + g1match.getDNA() + gR1.getDNA()) : "g1 didn't add up";

 return new GeneMatch(gL0, gR0, g0match, g1match, gL1, gR1);

}
+1  A: 

Looks quite okay to me. Just two minor things:

  1. reuse the result of g1DNA.indexOf(window) instead of calling it twice (g1Left = g1DNA.indexOf(window);)

  2. you don't have to check all 4 vars for being == -1 as you all set them at once anyway.

sfussenegger
A: 

Looks good to me. One might go ahead and micro optimize in terms of assignments, but this is the job of the JIT compiler. If you feel the algorithm is too slow, try to profile it.

Mirko Jahn
+2  A: 

Current approach

  1. Double g1DNA.indexOf(window) call - first call result can be stored and reused later;
  2. Unnecessary string objects construction during *window = g0DNA.substring(inx, inx + Constants.MIN_MATCH)*;
  3. Unnecessary gL0, gL1, gR0, gR1 construction in case assertion is off;
  4. if (g0DNA.equals("") || g1DNA.equals("")) check can be improved in order to check that the strings has at least four symbols each;
  5. It always better to call equals() on constant, i.e. use "".equals(arg). It allows to avoid possible NPE if arg is null. It doesn't have much impact here, just a good coding policy to apply;
  6. There is String.isEmpty() method that can be used to replace "".equals(arg);
  7. No null check is performed for the DNA strings;

Improvements

  1. It's better to loop the shortest string, i.e. you should check dna1 and dna2 length and perform outer loop against the one with shorter length. That allows to minimize iterations number;
  2. You can avoid creating new string objects and operate in terms of characters. Moreover, you can modify the algorithm in order to work with any java.lang.CharSequence implementation;
  3. You can remember unmatched sequences, i.e. keep set of char sequences that were checked and proved to be unmatched in order to minimize the time of outer loop iteration. For example you iterate over the string that contains many 'b' chars. You check that the second string doesn't contain that char during first 'b' processing. You can remember that and stop subsequent 'b' processings eagerly;
  4. When you use String.indexOf() the search is performed from start of the string. That may be problem if the string to be search on is rather long. It may be worth to create a characters index for it. I.e. before finding the match you can iterate all target string characters and build mappings like 'character' -> 'set of indexes of their occurrence within the string'. That allows to perform the loop body check much faster in case of long strings;

General consideration There is no 'the one best algorithm' because 'the best' selection depends on input data profile and algorithm usage policy. I.e. if the algorithm is executed rarely and its performance impact is insignificant there is no point in spending a lot of time to its optimization and much better to write a simple code that is easy to maintain. If input strings are rather short there is no point in building characters index etc. In general just try to avoid preliminary optimization whenever possible and carefully consider all input data during choosing resulting algorithm if you really have a bottleneck there.

denis.zhdanov
thanks for the detailed response. Can you specify what you mean by point (4.)? Some of the strings I'm dealing with do get rather long, and this algorithm does get called many times.
Rosarch
and for (3.), what do you think would be the fastest algorithm to store unmatched sequences in? HashSet? LinkedList? ArrayList?
Rosarch
and by "algorithm", I mean "data structure". oops.
Rosarch
Aboit the forth point - when you call String.indexOf(), the string characters are consequently iterated from the zero index in order to find the match. Consider the example when dna2 string starts with 'b' char and dna1 string is a very long string where first 'b' char occurence is located rather far from the start. If you just use String.indexOf() you perform lot of unnecessary comparisons then.
denis.zhdanov
About (3) - let's consider what do we want to achieve. I see the following targets - the data structure should be able to answer if particular char sequence is contained in it and response time should be as less as possible; stored data should be as lightweight as possible; we want to avoid creating new objects whenever possible. So, I see that we want to have a HashSet with objects of our custom class. That class should serve as a wrapper on a char sequence, i.e. we want to store target char sequence and offset and length to use with it.
denis.zhdanov
That implementation is very similar to standard java.lang.String (it also tries to reus the same char[] as much as possible), the only difference is that we'd like to have that class mutable in order to be able to reuse single object of that class during target string chars matching. I.e. the algorithm is the following:
denis.zhdanov
1. We create object of that class before main loop (provide it with reference to the target String/CharSequence object);2. On every iteration we update 'offset' and 'length' properties for that object and check if it is contained at the HashSet;3. If the object is contained at the HashSet we proceed to the next iteration;4. If the object is not contained at the HashSet we check if another string contains the current chars. If it doesn't contain we create new object of our custom class, define target offset and length and store it at HashSet;
denis.zhdanov