views:

73

answers:

4

Let's assume text is a String and contains a text. tags is an Array of Strings.

text = <<-EOS
Lorem ipsum dolor sit amet, consectetur adipisicing elit, 
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. 
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris 
nisi ut aliquip ex ea commodo consequat. 
Duis aute irure dolor in reprehenderit in voluptate velit 
esse cillum dolore eu fugiat nulla pariatur. 
Excepteur sint occaecat cupidatat non proident, 
sunt in culpa qui officia deserunt mollit anim id est laborum.
EOS

tags = [ "some", "in", "dolor", "laborum", "missing" ]

The algorithm should return all tags which are contained at least once in text. In the example above

[ "in", "dolor", "laborum" ]

The resulting array is not required to be sorted. Also, I don't actually need to know the number of occurrences of each tag in text.

I came with a few solutions, but none of them really convinced me. Any suggestion?

+1  A: 

What you are doing is a very mini version of a search engine. Your data is small enough you could just plow through it, split on spaces, for each string you want to find. As your text grows to 100s of pages long, this becomes not so good.

There is some crazy stuff you can do to make this faster. If you look at the source code for Lucene (http://lucene.apache.org/java/docs/index.html), you would likley get some hints.. as that is basically what base mode for lucene is for (find matches of text X in giant text Y). Internally I am not 100% sure what it does, but I feel like it is something along the lines of scanning the entire giant text, and building giant hashtables of word occurance and locations. So it would prescan and build a list of every word that can occur... and then you can ask it really quickly if "dolor" is in the text.

bwawok
+2  A: 
text.gsub!(/[[:punct:]]/,"").split
p tags.select{|x| x if text.include?(x)}
ghostdog74
Simone Carletti
A: 
hsh = {}
text.gsub(/[[:punct:]]/,"").split.each {|t| hsh[t]=true}
tags.select{|x| hsh.has_key?(x)}

I'm not sure how fast hashing is.

Tass
A: 

This is a well studied problem: multiple string pattern matching, of which many good solutions exist in the literature. Aho-Corasick provides a worst case optimal forward matching algorithm (ie execution complexity of O(|P|+|T|), where |P| is the sum of the length of all the strings you want to match) and |T| is the length of the text you match against). The set-backwards oracle matching (SBOM) algorithm is an example of a good backwards matching algorithm that has O(|P|X|T|) worst case complexity, but performs better than Aho-Corasick on average.

fritzv