Here's a response to the movement of the goalposts ("I probably need the regex because I'll need word delimiters in the near future"):
This method parses the text once to obtain a list of all the "words". Each word is looked up in a dictionary of the target words, and if it is a target word it is counted. The time taken is O(P) + O(T) where P is the size of the paragraph and T is the number of target words. All other solutions to date (including the currently accepted solution) except my Aho-Corasick solution are O(PT).
def counts_all(targets, paragraph, word_regex=r"\w+"):
tally = dict((target, 0) for target in targets)
for word in re.findall(word_regex, paragraph):
if word in tally:
tally[word] += 1
return [tally[target] for target in targets]
def counts_iter(targets, paragraph, word_regex=r"\w+"):
tally = dict((target, 0) for target in targets)
for matchobj in re.finditer(word_regex, paragraph):
word = matchobj.group()
if word in tally:
tally[word] += 1
return [tally[target] for target in targets]
The finditer version is a strawman -- it's much slower than the findall version.
Here's the currently accepted solution expressed in a standardised form and augmented with word delimiters:
def currently_accepted_solution_augmented(targets, paragraph):
def tester(s):
def f(x):
return len(re.findall(r"\b" + x + r"\b", s))
return f
return map(tester(paragraph), targets)
which goes overboard on closures and could be reduced to:
# acknowledgement:
# this is structurally the same as one of hughdbrown's benchmark functions
def currently_accepted_solution_augmented_without_extra_closure(targets, paragraph):
def tester(x):
return len(re.findall(r"\b" + x + r"\b", paragraph))
return map(tester, targets)
All variations on the currently accepted solution are O(PT). Unlike the currently accepted solution, the regex search with word delimiters is not equivalent to a simple paragraph.find(target)
. Because the re engine doesn't use the "fast search" in this case, adding the word delimiters changes it fron slow to very slow.