Matching substrings from a dictionary to other string: suggestions?

tags:

java
nlp

views:

105

answers:

+4 Q:

Matching substrings from a dictionary to other string: suggestions?

Hellow Stack Overflow people. I'd like some suggestions regarding the following problem. I am using Java.

I have an array #1 with a number of Strings. For example, two of the strings might be: "An apple fell on Newton's head" and "Apples grow on trees".

On the other side, I have another array #2 with terms like (Fruits => Apple, Orange, Peach; Items => Pen, Book; ...). I'd call this array my "dictionary".

By comparing items from one array to the other, I need to see in which "category" the items from #1 fall into from #2. E.g. Both from #1 would fall under "Fruits".

My most important consideration is speed. I need to do those operations fast. A structure allowing constant time retrieval would be good.

I considered a Hashset with the contains() method, but it doesn't allow substrings. I also tried running regex like (apple|orange|peach|...etc) with case insensitive flag on, but I read that it will not be fast when the terms increase in number (minimum 200 to be expected). Finally, I searched, and am considering using an ArrayList with indexOf() but I don't know about its performance. I also need to know which of the terms actually matched, so in this case, it would be "Apple".

Please provide your views, ideas and suggestions on this problem.

I saw Aho-Corasick algorithm, but the keywords/terms are very likely to change often. So I don't think I can use that. Oh, I'm no expert in text mining and maths, so please elaborate on complex concepts.

Thank you, Stack Overflow people, for your time! :)

+2 A:

Would a suffix tree or similar data structure work for your application? It offers O(m) string lookup, where m is the length of the search string, after an O(n²)--or better with some trickery--initial setup, and, with some extra effort, you can associate arbitrary data, such as a reference to a category, with complete words in your dictionary. If you don't want to code it yourself, I believe the BioJava library includes an implementation.

You can also add strings to a suffix tree after initial setup, although the cost will still be around O(n²). That's probably not a big deal if you're adding short words.

MattK 2010-01-06 15:41:22

Do note that suffix trees are *linear* (in both space and time) structures.

ariels 2010-01-06 16:16:45

You're right--that'll teach me to answer questions first thing in the morning. Of course, the search is linear in the length of the search string, not the length of the strings contained in the tree, which is still pretty efficient. Anyway, edited the answer to reflect that.

MattK 2010-01-06 22:44:19

You may want to consider using Knuth-Morris-Pratt with the Trie, but that may or may not give you a speed increase (and if it does you may or may not care).

Ken Bloom 2010-01-13 02:03:28

+3 A:

If you use a multimap from Google Collections, they have a function to invert the map (so you can start with a map like {"Fruits" => [Apple]}, and produce a map with {"Apple" => ["Fruits"]}. So you can lookup the word and find a list of categories for it, in one call to the map.

I would expect I'd want to split the strings myself and lookup the words in the map one at a time, so that I could do stemming (adjusting for different word endings) and stopword-filtering. Using the map should get good lookup times, plus it's easy to try out.

Nathan Hughes 2010-01-06 15:55:12

Stemming... Now that is something interesting and which I have missed. If I can stem (is it called like that?) the title "Apples grow on trees" to "Apple grow on tree" and tokenize that, I no longer need substring matching. Hashset's contains() method would give me what I need. Thanks Nathan Hughes. :D+1 for the stemming tip!

Inf.S 2010-01-06 16:19:33

If you have only 200 terms to look for, regexps might actually work for you. Of course the regular expression is large, but if you compile it once and just use this compiled Pattern the lookup time is probably linear in the combined length of all the strings in array#1 and I don't see how you can hope for being better than that.

So the algorithm would be: concatenate the words of array#2 which you want to look for into the regular expression, compile it, and then find the matches in array#1 .

(Regular expressions are compiled into a state machine - that is on each character of the string it just does a table lookup for the next state. If the regular expression is complicated you might have backtracking that increases the time, but your regular expression has a very simple structure.)

hstoerr 2010-01-06 16:04:28

My regex is indeed simple. Just (apple|oranges|peach|...etc) for all keywords, and one regex per category. I was in doubt about its performance though. I did compile the pattern for reuse.

Inf.S 2010-01-06 16:11:37

I don't understand completely what you want to do. But if you want to search in all strings in array#1 for anything that occurs in array#2, I'd probably make just ONE giant regexp with everything that occurs there, and search for that. Otherwise you have as many searches as categories. Everything I have found I'd look up in a HashMap that maps the words to their categories.To see whether this is feasible you can concatenate as many random words as you might get into such a giant regexp and check the time for searches.

hstoerr 2010-01-06 16:27:52

ansaurus

tags:

views:

answers:

Matching substrings from a dictionary to other string: suggestions?

related questions