ansaurus

Question

Answer 1

+2 A:

You are doing a join with nested loops? Yes, that is O(n^2). What about doing a hash join instead? That is, create a map from (lowercased) strText to t and do lookups with this map rather than iterating over the token container?

meriton 2010-09-05 20:29:25

Hello Meriton, Thank You for helping as well. Yes I did it, but don't wanna more. Performance was also ok with small strings, another reason with nested loops I was able to store strPositions(of same substrings) (almost) sorted in a Vector.

jackdaniels 2010-09-06 10:46:30

Yes I got it already there is no way around Hashing and Mapping.. I have to lear it..:-( Can you tell me pls, how can I do a HashJoin in Java? Didn't find any java related example -especially HashJoin in google... And if do HashJoin how can I store subStr-positons, these are necessary to store.

jackdaniels 2010-09-06 11:00:31

Please tell me also, why it is required to lowercase String? How can I create a "map from (lowercased) strText to t"? This sentence I didn’t really understood.. Thanks in advance.Please tell me also, why it is required to lowercase String? How can I create a "map from (lowercased) strText to t"? This sentence I didn’t really undertood..

jackdaniels 2010-09-06 11:03:21

lowercase is so that your program will consider "Token" and "token" to be the same word.

James 2010-09-06 11:28:03

Yes, I know what lowercase does. But what kind benefits I will get if using in (Hash)map related context? ( ... Interger!=integer... )

jackdaniels 2010-09-06 16:35:25

Answer 2

A:

Put the tokens of fileA into a trie data structure. Then when tokenising fileB you can check quite quickly if these tokens are in the trie. A few code comments would help.

James 2010-09-05 20:29:50

Thanks James, which data structure would you suggest to use?

jackdaniels 2010-09-06 10:21:01

More comments, with pleasure.. I am reading txt files with help of java tokenizer into a string and then trying to search substring of DocA in DocB. Doing this in 2 cases. 1st case substr-length is constat, 2nd case substr-length vary, for this reason i added "if (strLengthA >= dp.getMinStrLength())" to reduce iteretion for very short substrings.

jackdaniels 2010-09-06 10:32:54

A Trie: http://en.wikipedia.org/wiki/Trie.

James 2010-09-06 11:24:54

Answer 3

+6 A:

Your main problem is that you go through all txtB for each token in txtA.

You should store informations on token from txtA (in a HashMap for instance) and then in a second loop (but not a nested one) you compare the strings with the existing one in the Map.

On the same topic :

Colin Hebert 2010-09-05 20:29:58

Thank you Colin HEBERT, "nested" -> "for(){ for(){} }, ""not nested" -> "for(){}, for(){}" right? Hashmap I am really afraid of.. never code coded it before. Since i know in HashMap I have to use HashSet and here redundant tokens become removed!? Ok, I don't need them, but I need their positions. Can you tell me pls, if I can store and retrieve token positions with HashMap?

jackdaniels 2010-09-06 10:17:00

It's exacly this for the nested/not nested. If you want to keep the positions, you can do this `HashMap<String, List<Integer>>` so you can have for each word a list of its position. Or better, instead of Integer your own structure with filename, position and other informations.

Colin Hebert 2010-09-06 11:43:44

Huh... think I implemented your suggestion Colin. But somehow unable to get Hashmap paramters.. Can you have a look pls, programming code is here <http://pastebin.com/wScB5RSy>

jackdaniels 2010-09-06 16:31:03

You should try this : http://pastebin.com/ybaYBFj9

Colin Hebert 2010-09-06 17:41:15

Wow! Your code is just beatifull!! How many years programming expirience are behind this style!? It seems to be a HaschJoin, that "meriton" suggested me before. Right?

jackdaniels 2010-09-06 20:59:02

It is kind of like the "HashJoin" of @meriton. But I kept your code, so it doesn't remove punctuation and doesn't compare with lower case words.

Colin Hebert 2010-09-06 21:05:44

But now, HashMap is not sorted. And positions are totally disordered... How can I now retrieve tokens in right order? To get start/end of text, if have to search every token in entire map? And the I have to summerize suceed tokens? How can I manage this without nested loops?

jackdaniels 2010-09-06 21:11:54

Thank you!! Punctuation and other stuff I can manage with StringTokoniser utils ;-). It c ompares with original words, and this is what I need. Or is there any other advantage to do string uppercase? ..performance or match-number improvement?

jackdaniels 2010-09-06 21:24:56

You can replace the Sets by Lists if you want to preserve the order.

Colin Hebert 2010-09-06 21:41:49

Thanks a lot Colin!! It was a great example. Still I have to use 2x stringTokenizer-while-loops ( O(n^2)-?), but perfomance increases rapidly.

jackdaniels 2010-09-14 17:46:46

I don't think that a faster way exists, but I would be glad to see it if it does.

Colin Hebert 2010-09-14 20:28:48

Answer 4

A:

A said, this is an issue of complexity and you're algorithm runs in O(n^2) instead of O(n) using hash.

For second order improvements try to call less to functions, for example you can get the size once

sizeB = txtB.TokenContainer.size();

Depeneds on the size, you may call the container once to get an array of strings to save the getStr....

Roni

roni 2010-09-05 21:06:31

Thanks Roni, I was not sure if calling functions will take some performance. But of course, especially "txtB.TokenContainer.size();" programm calls every n-times, absolutly unnecessary.

jackdaniels 2010-09-06 09:59:12

ansaurus

tags:

views:

answers:

Java string comparison

related questions