views:

252

answers:

7

I'm trying to get the number of times each item in a list is in a string in Python:

paragraph = "I eat bananas and a banana"

def tester(x): return len(re.findall(x,paragraph))

map(tester, ['banana', 'loganberry', 'passion fruit'])

Returns [2, 0, 0]

What I'd like to do however is extend this so I can feed the paragraph value into the map() function. Right now, the tester() function has paragraph hardcoded. Does anybody have a way to do this (perhaps make an n-length list of paragraph values)? Any other ideas here?

Keep in mind that each of the array values will have a weight at some point in the future - hence the need to keep the values in a list rather than crunching them all together.

UPDATE: The paragraph will often be 20K and the list will often have 200+ members. My thinking is that map operates in parallel - so it will be much more efficient than any serial methods.

+8  A: 

A closure would be a quick solution:

paragraph = "I eat bananas and a banana"

def tester(s): 
    def f(x):
        return len(re.findall(x,s))
    return f

print map(tester(paragraph), ['banana', 'loganberry', 'passion fruit'])
oggy
Thumbs up to that. It also allows for increasing functionality if things change subtly. Of course, if he does much more than that, he should probably rethink so he doesn't make a bowl of spaghetti like I would :)
JohnMetta
This works (like the others) but I'm pretty sure this will be the most efficient for larger sets and paragraphs. And, as mettadore says, reasonably extensible.
Adam Nelson
Update - this method is quite fast:230ms for paragraph 102k characters and 600 list members on a previous generation mac mini with Python 2.5.4.
Adam Nelson
Now try it with `len(re.findall(x,s))` replaced by `s.count(x)`
John Machin
will do - although I probably need the regex because I'll need word delimiters in the near future.
Adam Nelson
+2  A: 

I know you didn't ask for list comprehension, but here it is anyway:

paragraph = "I eat bananas and a banana"
words = ['banana', 'loganberry', 'passion fruit']
[len(re.findall(word, paragraph)) for word in words]

This returns [2, 0, 0] as well.

Dan Lorenc
A list comprehension is completely dual to map.
TokenMacGuy
I believe this will go over paragraph serially - I probably should have made it clear but this needs to be very efficient (which is why I was thinking map would be best - I could be very wrong though).
Adam Nelson
@TokenMacGuy: "completely dual" is interesting terminology. The map function has functionality that list comprehensions don't have, and vice versa. About all you can say is that `map(callable, iterable)` is functionally equivalent to `[callable(x) for x in iterable]`
John Machin
+3  A: 
targets = ['banana', 'loganberry', 'passion fruit']
paragraph = "I eat bananas and a banana"

print [paragraph.count(target) for target in targets]

No idea why you would use map() here.

hughdbrown
Good option but I'm concerned about speed with this. The paragraph will often be 20K and the list will often have 200+ members.
Adam Nelson
I ran this with paragraph length = 26000, targets = 100 each of banana, loganberry, passionfruit. I tried map versus list comprehension and len(regex.findall()) versus string.count(). Regex was about 4-5 times slower than string.count. map was 25% faster than list comprehension. If you are interested in speed, don't use the regex.
hughdbrown
Good advice. In the end I think I'll be forced to do the regex because I will need to include word delimiters in the near future. I was trying to abstract away unnecessary issues from the question.Thanks though - great answers all around.
Adam Nelson
And the solution with the top votes and your vote as best solution -- slow. About the same speed as the solution you wrote to start this off. About 5 times slower than a solution that uses string.count, either with map or list comprehension.See benchmark code:http://github.com/hughdbrown/StackOverflow/blob/5c50553a59071b403787350c1d3153bebfb35918/1168777.py
hughdbrown
@Adam Nelson: If you are searching for words not arbitrary strings, you should say so up front. You have abstracted away a NECESSARY issue. Another: the large number of query words/strings was not mentioned initially. Your new "words" requirement may be met by parsing the text into words once and doing a dict lookup on each text word -- quite different to changing your current "best answer" to something which would require an re.compile() for each query word.
John Machin
A: 

Here's my version.

paragraph = "I eat bananas and a banana"

def tester(paragraph, x): return len(re.findall(x,paragraph))

print lambda paragraph: map(
    lambda x: tester(paragraph, x) , ['banana', 'loganberry', 'passion fruit']
        )(paragraph)
TokenMacGuy
+2  A: 

This is basically just going out of your way to avoid a list comprehension, but if you like functional style programming, then you'll like functools.partial.

>>> from functools import partial
>>> def counter(text, paragraph):
    return len(re.findall(text, paragraph))

>>> tester = partial(counter, paragraph="I eat bananas and a banana")
>>> map(tester, ['banana', 'loganberry', 'passion fruit'])
[2, 0, 0]
Coady
+1  A: 

For Q query words of average length L bytes on large texts of size T bytes, you need something that's NOT O(QLT). You need a DFA-style approach which can give you O(T) ... after setup costs. If your query set is rather static, then the setup cost can be ignored.

E.g. http://en.wikipedia.org/wiki/Aho-Corasick_algorithm
which points to a C-extension for Python:
http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

John Machin
It looks like the maintained implementation of this in Python is pytst. Thanks very much for the lead:http://github.com/nlehuen/pytst/tree/master
Adam Nelson
+1  A: 

Here's a response to the movement of the goalposts ("I probably need the regex because I'll need word delimiters in the near future"):

This method parses the text once to obtain a list of all the "words". Each word is looked up in a dictionary of the target words, and if it is a target word it is counted. The time taken is O(P) + O(T) where P is the size of the paragraph and T is the number of target words. All other solutions to date (including the currently accepted solution) except my Aho-Corasick solution are O(PT).

def counts_all(targets, paragraph, word_regex=r"\w+"):
    tally = dict((target, 0) for target in targets)
    for word in re.findall(word_regex, paragraph):
        if word in tally:
            tally[word] += 1
    return [tally[target] for target in targets]

def counts_iter(targets, paragraph, word_regex=r"\w+"):
    tally = dict((target, 0) for target in targets)
    for matchobj in re.finditer(word_regex, paragraph):
        word = matchobj.group()
        if word in tally:
            tally[word] += 1
    return [tally[target] for target in targets]

The finditer version is a strawman -- it's much slower than the findall version.

Here's the currently accepted solution expressed in a standardised form and augmented with word delimiters:

def currently_accepted_solution_augmented(targets, paragraph):
    def tester(s): 
        def f(x):
            return len(re.findall(r"\b" + x + r"\b", s))
        return f
    return map(tester(paragraph), targets)

which goes overboard on closures and could be reduced to:

# acknowledgement:
# this is structurally the same as one of hughdbrown's benchmark functions
def currently_accepted_solution_augmented_without_extra_closure(targets, paragraph):
    def tester(x):
        return len(re.findall(r"\b" + x + r"\b", paragraph))
    return map(tester, targets)

All variations on the currently accepted solution are O(PT). Unlike the currently accepted solution, the regex search with word delimiters is not equivalent to a simple paragraph.find(target). Because the re engine doesn't use the "fast search" in this case, adding the word delimiters changes it fron slow to very slow.

John Machin
I think it's weird to switch 'accepted answers' but since StackOverflow let's me do it, I presume they want the best answers to percolate to the top even when the question has been modified and an answer has previously been chosen. Anyway, at this point, this is the best answer (by far - with that timerscript on GitHub especially). Cheers.
Adam Nelson
Ummm - the timer script on github was the work of hughdbrown, not me.
John Machin
Oops, I meant to attribute the GitHub script to @hughdbrown - nonetheless, this is the most appropriate answer.
Adam Nelson