ansaurus

Question

Java: Finding matches between Strings

Answer 1

+1 A:

Looks quite okay to me. Just two minor things:

reuse the result of g1DNA.indexOf(window) instead of calling it twice (g1Left = g1DNA.indexOf(window);)
you don't have to check all 4 vars for being == -1 as you all set them at once anyway.

sfussenegger 2009-10-16 07:08:12

Answer 2

A:

Looks good to me. One might go ahead and micro optimize in terms of assignments, but this is the job of the JIT compiler. If you feel the algorithm is too slow, try to profile it.

Mirko Jahn 2009-10-16 07:09:45

Answer 3

+2 A:

Current approach

Double g1DNA.indexOf(window) call - first call result can be stored and reused later;
Unnecessary string objects construction during *window = g0DNA.substring(inx, inx + Constants.MIN_MATCH)*;
Unnecessary gL0, gL1, gR0, gR1 construction in case assertion is off;
if (g0DNA.equals("") || g1DNA.equals("")) check can be improved in order to check that the strings has at least four symbols each;
It always better to call equals() on constant, i.e. use "".equals(arg). It allows to avoid possible NPE if arg is null. It doesn't have much impact here, just a good coding policy to apply;
There is String.isEmpty() method that can be used to replace "".equals(arg);
No null check is performed for the DNA strings;

Improvements

It's better to loop the shortest string, i.e. you should check dna1 and dna2 length and perform outer loop against the one with shorter length. That allows to minimize iterations number;
You can avoid creating new string objects and operate in terms of characters. Moreover, you can modify the algorithm in order to work with any java.lang.CharSequence implementation;
You can remember unmatched sequences, i.e. keep set of char sequences that were checked and proved to be unmatched in order to minimize the time of outer loop iteration. For example you iterate over the string that contains many 'b' chars. You check that the second string doesn't contain that char during first 'b' processing. You can remember that and stop subsequent 'b' processings eagerly;
When you use String.indexOf() the search is performed from start of the string. That may be problem if the string to be search on is rather long. It may be worth to create a characters index for it. I.e. before finding the match you can iterate all target string characters and build mappings like 'character' -> 'set of indexes of their occurrence within the string'. That allows to perform the loop body check much faster in case of long strings;

General consideration There is no 'the one best algorithm' because 'the best' selection depends on input data profile and algorithm usage policy. I.e. if the algorithm is executed rarely and its performance impact is insignificant there is no point in spending a lot of time to its optimization and much better to write a simple code that is easy to maintain. If input strings are rather short there is no point in building characters index etc. In general just try to avoid preliminary optimization whenever possible and carefully consider all input data during choosing resulting algorithm if you really have a bottleneck there.

denis.zhdanov 2009-10-16 07:18:01

thanks for the detailed response. Can you specify what you mean by point (4.)? Some of the strings I'm dealing with do get rather long, and this algorithm does get called many times.

Rosarch 2009-10-16 19:55:05

and for (3.), what do you think would be the fastest algorithm to store unmatched sequences in? HashSet? LinkedList? ArrayList?

Rosarch 2009-10-16 19:56:55

and by "algorithm", I mean "data structure". oops.

Rosarch 2009-10-17 01:09:45

Aboit the forth point - when you call String.indexOf(), the string characters are consequently iterated from the zero index in order to find the match. Consider the example when dna2 string starts with 'b' char and dna1 string is a very long string where first 'b' char occurence is located rather far from the start. If you just use String.indexOf() you perform lot of unnecessary comparisons then.

denis.zhdanov 2009-10-17 13:06:59

About (3) - let's consider what do we want to achieve. I see the following targets - the data structure should be able to answer if particular char sequence is contained in it and response time should be as less as possible; stored data should be as lightweight as possible; we want to avoid creating new objects whenever possible. So, I see that we want to have a HashSet with objects of our custom class. That class should serve as a wrapper on a char sequence, i.e. we want to store target char sequence and offset and length to use with it.

denis.zhdanov 2009-10-17 13:07:39

That implementation is very similar to standard java.lang.String (it also tries to reus the same char[] as much as possible), the only difference is that we'd like to have that class mutable in order to be able to reuse single object of that class during target string chars matching. I.e. the algorithm is the following:

denis.zhdanov 2009-10-17 13:16:42

1. We create object of that class before main loop (provide it with reference to the target String/CharSequence object);2. On every iteration we update 'offset' and 'length' properties for that object and check if it is contained at the HashSet;3. If the object is contained at the HashSet we proceed to the next iteration;4. If the object is not contained at the HashSet we check if another string contains the current chars. If it doesn't contain we create new object of our custom class, define target offset and length and store it at HashSet;

denis.zhdanov 2009-10-17 13:19:53

ansaurus

tags:

views:

answers:

Java: Finding matches between Strings

related questions