ansaurus

Question

Given a String and an array of Strings, how do I efficiently calculate the occurrences of the array in String?

Answer 1

+1 A:

What you are doing is a very mini version of a search engine. Your data is small enough you could just plow through it, split on spaces, for each string you want to find. As your text grows to 100s of pages long, this becomes not so good.

There is some crazy stuff you can do to make this faster. If you look at the source code for Lucene (http://lucene.apache.org/java/docs/index.html), you would likley get some hints.. as that is basically what base mode for lucene is for (find matches of text X in giant text Y). Internally I am not 100% sure what it does, but I feel like it is something along the lines of scanning the entire giant text, and building giant hashtables of word occurance and locations. So it would prescan and build a list of every word that can occur... and then you can ask it really quickly if "dolor" is in the text.

bwawok 2010-10-08 12:22:41

Answer 2

+2 A:

text.gsub!(/[[:punct:]]/,"").split
p tags.select{|x| x if text.include?(x)}

ghostdog74 2010-10-08 12:25:25

Simone Carletti 2010-10-09 15:10:43

Answer 3

A:

hsh = {}
text.gsub(/[[:punct:]]/,"").split.each {|t| hsh[t]=true}
tags.select{|x| hsh.has_key?(x)}

I'm not sure how fast hashing is.

Tass 2010-10-08 12:57:30

Answer 4

A:

This is a well studied problem: multiple string pattern matching, of which many good solutions exist in the literature. Aho-Corasick provides a worst case optimal forward matching algorithm (ie execution complexity of O(|P|+|T|), where |P| is the sum of the length of all the strings you want to match) and |T| is the length of the text you match against). The set-backwards oracle matching (SBOM) algorithm is an example of a good backwards matching algorithm that has O(|P|X|T|) worst case complexity, but performs better than Aho-Corasick on average.

fritzv 2010-10-09 18:47:11

ansaurus

tags:

views:

answers:

Given a String and an array of Strings, how do I efficiently calculate the occurrences of the array in String?

related questions