views:

243

answers:

4

I am comparing substrings in two large text files. Very simple, tokenizing into two token containers, comparing with 2 for loops. Performance is disastrous! Does anybody have an advice or idea how to improve performance?

for (int s = 0; s < txtA.TokenContainer.size(); s++) {
    String strTxtA = txtA.getSubStr(s);
    strLengthA = txtA.getNumToken(s);

    if (strLengthA >= dp.getMinStrLength()) {
        int tokenFileB = 1;

        for (int t = 0; t < txtB.TokenContainer.size(); t++) {
            String strTxtB = txtB.getSubStr(t);
            strLengthB = txtB.getNumToken(t);

            if (strTxtA.equalsIgnoreCase(strTxtB)) {
                try {
                    subStrTemp = new SubStrTemp(
                        txtA.ID, txtB.ID, tokenFileA, tokenFileB,
                        (tokenFileA + strLengthA - 1), 
                        (tokenFileB + strLengthB - 1));

                    if (subStrContainer.contains(subStrTemp) == false) {
                        subStrContainer.addElement(subStrTemp);
                    }
                } catch (Exception ex) {
                    logger.error("error");
                }
            }
            tokenFileB += strLengthB;
        }
        tokenFileA += strLengthA;
    }
}

Generally my code reading two large Strings with Java Tokonizer into containers A and B. And then trying to compare substrings.Possision of Substrgs which are existing in both strings to store into a Vector. But performance is awful, also don't really know how to solve it with HashMap.

+2  A: 

You are doing a join with nested loops? Yes, that is O(n^2). What about doing a hash join instead? That is, create a map from (lowercased) strText to t and do lookups with this map rather than iterating over the token container?

meriton
Hello Meriton, Thank You for helping as well. Yes I did it, but don't wanna more. Performance was also ok with small strings, another reason with nested loops I was able to store strPositions(of same substrings) (almost) sorted in a Vector.
jackdaniels
Yes I got it already there is no way around Hashing and Mapping.. I have to lear it..:-( Can you tell me pls, how can I do a HashJoin in Java? Didn't find any java related example -especially HashJoin in google... And if do HashJoin how can I store subStr-positons, these are necessary to store.
jackdaniels
Please tell me also, why it is required to lowercase String? How can I create a "map from (lowercased) strText to t"? This sentence I didn’t really understood.. Thanks in advance.Please tell me also, why it is required to lowercase String? How can I create a "map from (lowercased) strText to t"? This sentence I didn’t really undertood..
jackdaniels
lowercase is so that your program will consider "Token" and "token" to be the same word.
James
Yes, I know what lowercase does. But what kind benefits I will get if using in (Hash)map related context? ( ... Interger!=integer... )
jackdaniels
A: 

Put the tokens of fileA into a trie data structure. Then when tokenising fileB you can check quite quickly if these tokens are in the trie. A few code comments would help.

James
Thanks James, which data structure would you suggest to use?
jackdaniels
More comments, with pleasure.. I am reading txt files with help of java tokenizer into a string and then trying to search substring of DocA in DocB. Doing this in 2 cases. 1st case substr-length is constat, 2nd case substr-length vary, for this reason i added "if (strLengthA >= dp.getMinStrLength())" to reduce iteretion for very short substrings.
jackdaniels
A Trie: http://en.wikipedia.org/wiki/Trie.
James
+6  A: 

Your main problem is that you go through all txtB for each token in txtA.

You should store informations on token from txtA (in a HashMap for instance) and then in a second loop (but not a nested one) you compare the strings with the existing one in the Map.


On the same topic :

Colin Hebert
Thank you Colin HEBERT, "nested" -> "for(){ for(){} }, ""not nested" -> "for(){}, for(){}" right? Hashmap I am really afraid of.. never code coded it before. Since i know in HashMap I have to use HashSet and here redundant tokens become removed!? Ok, I don't need them, but I need their positions. Can you tell me pls, if I can store and retrieve token positions with HashMap?
jackdaniels
It's exacly this for the nested/not nested. If you want to keep the positions, you can do this `HashMap<String, List<Integer>>` so you can have for each word a list of its position. Or better, instead of Integer your own structure with filename, position and other informations.
Colin Hebert
Huh... think I implemented your suggestion Colin. But somehow unable to get Hashmap paramters.. Can you have a look pls, programming code is here <http://pastebin.com/wScB5RSy>
jackdaniels
You should try this : http://pastebin.com/ybaYBFj9
Colin Hebert
Wow! Your code is just beatifull!! How many years programming expirience are behind this style!? It seems to be a HaschJoin, that "meriton" suggested me before. Right?
jackdaniels
It is kind of like the "HashJoin" of @meriton. But I kept your code, so it doesn't remove punctuation and doesn't compare with lower case words.
Colin Hebert
But now, HashMap is not sorted. And positions are totally disordered... How can I now retrieve tokens in right order? To get start/end of text, if have to search every token in entire map? And the I have to summerize suceed tokens? How can I manage this without nested loops?
jackdaniels
Thank you!! Punctuation and other stuff I can manage with StringTokoniser utils ;-). It c ompares with original words, and this is what I need. Or is there any other advantage to do string uppercase? ..performance or match-number improvement?
jackdaniels
You can replace the Sets by Lists if you want to preserve the order.
Colin Hebert
Thanks a lot Colin!! It was a great example. Still I have to use 2x stringTokenizer-while-loops ( O(n^2)-?), but perfomance increases rapidly.
jackdaniels
I don't think that a faster way exists, but I would be glad to see it if it does.
Colin Hebert
A: 

A said, this is an issue of complexity and you're algorithm runs in O(n^2) instead of O(n) using hash.

For second order improvements try to call less to functions, for example you can get the size once

sizeB = txtB.TokenContainer.size();

Depeneds on the size, you may call the container once to get an array of strings to save the getStr....

Roni

roni
Thanks Roni, I was not sure if calling functions will take some performance. But of course, especially "txtB.TokenContainer.size();" programm calls every n-times, absolutly unnecessary.
jackdaniels