ansaurus

Question

Using map() to get number of times list elements exist in a string in Python

Answer 1

+8 A:

A closure would be a quick solution:

paragraph = "I eat bananas and a banana"

def tester(s): 
    def f(x):
        return len(re.findall(x,s))
    return f

print map(tester(paragraph), ['banana', 'loganberry', 'passion fruit'])

oggy 2009-07-22 21:57:41

Thumbs up to that. It also allows for increasing functionality if things change subtly. Of course, if he does much more than that, he should probably rethink so he doesn't make a bowl of spaghetti like I would :)

JohnMetta 2009-07-22 22:07:52

This works (like the others) but I'm pretty sure this will be the most efficient for larger sets and paragraphs. And, as mettadore says, reasonably extensible.

Adam Nelson 2009-07-23 14:13:16

Update - this method is quite fast:230ms for paragraph 102k characters and 600 list members on a previous generation mac mini with Python 2.5.4.

Adam Nelson 2009-07-23 14:34:12

Now try it with `len(re.findall(x,s))` replaced by `s.count(x)`

John Machin 2009-07-23 16:37:05

will do - although I probably need the regex because I'll need word delimiters in the near future.

Adam Nelson 2009-07-23 19:05:58

Answer 2

+2 A:

I know you didn't ask for list comprehension, but here it is anyway:

paragraph = "I eat bananas and a banana"
words = ['banana', 'loganberry', 'passion fruit']
[len(re.findall(word, paragraph)) for word in words]

This returns [2, 0, 0] as well.

Dan Lorenc 2009-07-22 22:34:55

A list comprehension is completely dual to map.

TokenMacGuy 2009-07-22 23:13:15

I believe this will go over paragraph serially - I probably should have made it clear but this needs to be very efficient (which is why I was thinking map would be best - I could be very wrong though).

Adam Nelson 2009-07-23 14:07:37

@TokenMacGuy: "completely dual" is interesting terminology. The map function has functionality that list comprehensions don't have, and vice versa. About all you can say is that `map(callable, iterable)` is functionally equivalent to `[callable(x) for x in iterable]`

John Machin 2009-07-24 03:40:24

Answer 3

+3 A:

targets = ['banana', 'loganberry', 'passion fruit']
paragraph = "I eat bananas and a banana"

print [paragraph.count(target) for target in targets]

No idea why you would use map() here.

hughdbrown 2009-07-22 23:06:47

Good option but I'm concerned about speed with this. The paragraph will often be 20K and the list will often have 200+ members.

Adam Nelson 2009-07-23 13:38:28

I ran this with paragraph length = 26000, targets = 100 each of banana, loganberry, passionfruit. I tried map versus list comprehension and len(regex.findall()) versus string.count(). Regex was about 4-5 times slower than string.count. map was 25% faster than list comprehension. If you are interested in speed, don't use the regex.

hughdbrown 2009-07-23 18:53:31

Good advice. In the end I think I'll be forced to do the regex because I will need to include word delimiters in the near future. I was trying to abstract away unnecessary issues from the question.Thanks though - great answers all around.

Adam Nelson 2009-07-23 19:05:11

And the solution with the top votes and your vote as best solution -- slow. About the same speed as the solution you wrote to start this off. About 5 times slower than a solution that uses string.count, either with map or list comprehension.See benchmark code:http://github.com/hughdbrown/StackOverflow/blob/5c50553a59071b403787350c1d3153bebfb35918/1168777.py

hughdbrown 2009-07-23 19:08:38

@Adam Nelson: If you are searching for words not arbitrary strings, you should say so up front. You have abstracted away a NECESSARY issue. Another: the large number of query words/strings was not mentioned initially. Your new "words" requirement may be met by parsing the text into words once and doing a dict lookup on each text word -- quite different to changing your current "best answer" to something which would require an re.compile() for each query word.

John Machin 2009-07-24 01:00:26

Answer 4

A:

Here's my version.

paragraph = "I eat bananas and a banana"

def tester(paragraph, x): return len(re.findall(x,paragraph))

print lambda paragraph: map(
    lambda x: tester(paragraph, x) , ['banana', 'loganberry', 'passion fruit']
        )(paragraph)

TokenMacGuy 2009-07-22 23:16:55

Answer 5

+2 A:

This is basically just going out of your way to avoid a list comprehension, but if you like functional style programming, then you'll like functools.partial.

>>> from functools import partial
>>> def counter(text, paragraph):
    return len(re.findall(text, paragraph))

>>> tester = partial(counter, paragraph="I eat bananas and a banana")
>>> map(tester, ['banana', 'loganberry', 'passion fruit'])
[2, 0, 0]

Coady 2009-07-22 23:19:36

Answer 6

+1 A:

For Q query words of average length L bytes on large texts of size T bytes, you need something that's NOT O(QLT). You need a DFA-style approach which can give you O(T) ... after setup costs. If your query set is rather static, then the setup cost can be ignored.

E.g. http://en.wikipedia.org/wiki/Aho-Corasick_algorithm
which points to a C-extension for Python:
http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

John Machin 2009-07-23 15:58:19

It looks like the maintained implementation of this in Python is pytst. Thanks very much for the lead:http://github.com/nlehuen/pytst/tree/master

Adam Nelson 2009-07-24 13:49:20

Answer 7

+1 A:

Here's a response to the movement of the goalposts ("I probably need the regex because I'll need word delimiters in the near future"):

This method parses the text once to obtain a list of all the "words". Each word is looked up in a dictionary of the target words, and if it is a target word it is counted. The time taken is O(P) + O(T) where P is the size of the paragraph and T is the number of target words. All other solutions to date (including the currently accepted solution) except my Aho-Corasick solution are O(PT).

def counts_all(targets, paragraph, word_regex=r"\w+"):
    tally = dict((target, 0) for target in targets)
    for word in re.findall(word_regex, paragraph):
        if word in tally:
            tally[word] += 1
    return [tally[target] for target in targets]

def counts_iter(targets, paragraph, word_regex=r"\w+"):
    tally = dict((target, 0) for target in targets)
    for matchobj in re.finditer(word_regex, paragraph):
        word = matchobj.group()
        if word in tally:
            tally[word] += 1
    return [tally[target] for target in targets]

The finditer version is a strawman -- it's much slower than the findall version.

Here's the currently accepted solution expressed in a standardised form and augmented with word delimiters:

def currently_accepted_solution_augmented(targets, paragraph):
    def tester(s): 
        def f(x):
            return len(re.findall(r"\b" + x + r"\b", s))
        return f
    return map(tester(paragraph), targets)

which goes overboard on closures and could be reduced to:

# acknowledgement:
# this is structurally the same as one of hughdbrown's benchmark functions
def currently_accepted_solution_augmented_without_extra_closure(targets, paragraph):
    def tester(x):
        return len(re.findall(r"\b" + x + r"\b", paragraph))
    return map(tester, targets)

All variations on the currently accepted solution are O(PT). Unlike the currently accepted solution, the regex search with word delimiters is not equivalent to a simple paragraph.find(target). Because the re engine doesn't use the "fast search" in this case, adding the word delimiters changes it fron slow to very slow.

John Machin 2009-07-24 03:21:15

I think it's weird to switch 'accepted answers' but since StackOverflow let's me do it, I presume they want the best answers to percolate to the top even when the question has been modified and an answer has previously been chosen. Anyway, at this point, this is the best answer (by far - with that timerscript on GitHub especially). Cheers.

Adam Nelson 2009-07-24 13:35:14

Ummm - the timer script on github was the work of hughdbrown, not me.

John Machin 2009-07-24 13:40:54

Oops, I meant to attribute the GitHub script to @hughdbrown - nonetheless, this is the most appropriate answer.

Adam Nelson 2009-07-24 14:04:25

ansaurus

tags:

views:

answers:

Using map() to get number of times list elements exist in a string in Python

related questions