views:

143

answers:

4

Hi all!

I am doing string matching with big amount of data.

EDIT: I am matching words contained in a big list with some ontology text files. I take each file from ontology, and search for a match between the third String of each file line and any word from the list.

I made a mistake in overseeing the fact that what I need to do is not pure matching (results are poor), but I need some looser matching function that will also return results when the string is contained inside another string.

I did this with a Radix Trie; it was very fast and works nice, but now I guess my work is useless because a trie returns only exact matches. :/

  • Type of algorithms that do this are string searching algorithms?
  • Can somebody suggest some Java implementations that he has experience with?

The algorithm should be fast, but is not top top priority, would compomise with speed & complexity.

I am very grateful for all advice/examples/explanations/links!

Thank you!

A: 

I'm not entirely sure if I understood the question correctly, but it sounds like regular expressions would do the job

http://java.sun.com/developer/technicalArticles/releases/1.4regex/

Xzhsh
@Xzhsh I modified my explanation, i did not explain clearly sorry!
Julia
A: 

Regular expressions are definitely your best bet. They can be a little bit messy to write, but they're the only way that you can have a looser matching without having an incomprehensible series of if/else or switch statements.

Plus, they'll be a lot faster than the alternative.

thebackhand
@thebackhand I modified my explanation, i did not explain clearly sorry!
Julia
-1: Why are regex 'best'? Why are the alternatives if/else switch statements? What other alternatives did you consider before claiming the alternatives are slower? I would say the performance of regexs will be quite bad! You have to compile them, then possible backtracking during matching etc...
Moron
Well, the way the question was originally phrased (pre-edit), that's the way I read it - obviously, it no longer applies!
thebackhand
+3  A: 

You might find Suffix Trees useful (they are similar in concept to Tries).

Each string, you prepend with ^ and end with $ and create a suffix tree of all the strings appended. Space usage will be O(n) and will be probably worse than what you had for the trie.

If you now need to search for a string s, you can easily do in O(|s|) time, just like a trie and the match you get will be a substring match (basically, you will be matching some suffix of some string).

Sorry, I don't have a reference to a Java implementation handy.

Found a useful stackoverflow answer: http://stackoverflow.com/questions/969448/generalized-suffix-tree-java-implementation

Which has: http://illya-keeplearning.blogspot.com/2009/04/suffix-trees-java-ukkonens-algorithm.html

Which in turn has: Source Code: http://illya.yolasite.com/resources/suffix-tree.zip

Moron
@Moron: I think this might be exactly what i would need, if i understand well, I can do "match" and "contains" with the same tree????
Julia
@Julia: Yes exactly. If you want exact match, prepend your search string with ^ and append with $ and do the match. If you want contains, use the search string as-is.
Moron
@Moron: <sigh> Seems this would be perfect. There must be some java lib!!
Julia
@Julia: Check out the links I added to this answer.
Moron
@Moron: Thank you very much!
Julia
+1  A: 

you can use BM algorithm for search in text files for single pattern, and repeat this algorithm for all the patterns you have in your list.

the other best solution is to use multi-pattern search algorithms like: Aho–Corasick string matching algorithm

Wajdy Essam
@Wajdy Essam: http://johannburkard.de/software/stringsearch/ ?You say searching in text files, but i do not need matching anywhere in text file, but every third string from each line, that can be specified? (sorry for details i am affraid to rush into something like i did with radix trie)
Julia
BM algorithm matching any string without concern the source of strings (from text in file, from cell in db... etc).
Wajdy Essam