views:

86

answers:

4

Hi folks,

I have a huge array list which contains 1000 entries out of which one of the entry is "world". And, I have a word "big world". I want to get the word "big world" matched with "world" in the arraylist.

What is the most cost effective way of doing it? I cannot use .contains method of array list, and If I traverse all the 1000 entries and match them by pattern its going to be very costly in terms of performance. I am using Java for this.

Could you please let me know what is the best way for this?

Cheers, J

+1  A: 

Assuming you dont know the content of the arraylist elements. you will have to traverse the whole arraylist.

Traversing the arraylist would cost you O(n).

Sorting the arraylist wouldnt help you because you are talking about a searching a string in a set of strings. and still sorting would be more expensive. O(nlogn)

A: 

If you have to search the list repeatedly, it may make sense to use the sort() and binarySearch() methods of Collections.

Addendum: As noted by @user177883, the cost of an O(n log n) sort must be weighed against the benefit of subsequent O(log n) searches.

The word "heart" matches the [word] "ear".

As an exact match is insufficient, this approach would be inadequate.

trashgod
sorting would be more expensive than searching.
I could do that, but if you see the binarySearch is written to return an exact match. Though I may write a customized Comparator, it may be tough to identify a loose match.
Abhishek
how do you know if the user want to stop when user finds the value. what is the user wants to find all occurences of the string. then you will have to employ many binary search. each time you find an occurence, you remove it from the adt and you do another binary search, and the worst case you might end up doing n binary searches. your worst case complexity would be 2nlogn. which is very in efficient compared to sequential search.
An inexpensive binary search would make checking each word in the candidate string for an exact match feasible. As @Moron notes, it may be useful to clarify your match criteria.
trashgod
+1  A: 

You can split up every single element of the ArrayList into words and stop as soon as you find one of them.

I suppose by your profile you develop in Java, with Lucene you would easily do something like that

public class NodesAnalyzer extends Analyzer {   
    public TokenStream tokenStream(String fieldName, Reader reader) {

        Tokenizer tokenizer = new StandardTokenizer(reader)
        TokenFilter lowerCaseFilter = new LowerCaseFilter(tokenizer)
        TokenFilter stopFilter = new StopFilter(lowerCaseFilter, Data.stopWords.collect{ it.text } as String[])
        SnowballFilter snowballFilter = new SnowballFilter(stopFilter, new org.tartarus.snowball.ext.ItalianStemmer())

        return snowballFilter
    }   
}

    Analyzer analyzer = new NodesAnalyzer()

    TokenStream ts = analyzer.tokenStream(null, new StringReader(str)); 
    Token token = ts.next()

    while (token != null) {
       String cur = token.term()
       token = ts.next();
    }

Note: this is Groovy code that I copied from a personal project so you will have to translate things like Data.stopWords.collect{ it.text } as String[] to use with plain Java

Jack
Lucene would be a good fit for this, especially if it grows beyond 1000 words.
bwawok
Hi Jack, Does the sample code use Lucene?
Abhishek
the one I pasted? yes..
Jack
A: 

I have assumed that given an input string you need to find a string in that set, which is a substring of the given input string (not the other way round).

You can try to make your matching fast by inserting all the strings in a Trie and getting rid of the ArrayList. You first append a $ to each string, and then insert into the trie.

Now given an input string, you start at first letter on input and walk down the trie till you are able to hit a node with $ (indicating that some word of 1000 is a prefix of your given word). If not, you move to second letter of input and walk down the trie again until you find a match.

This should be worst case O(N^2), where N is the length of the input string, but in practice this will be much faster.

Moron
@jadaaih: Any questions regarding this answer?
Moron