views:

1145

answers:

7

This seems like it should be pretty trivial, but I am new at Python and want to do it the most Pythonic way.

I want to find the n'th occurrence of a substring in a string.

There's got to be something equivalent to what I WANT to do which is

mystring.find("substring", 2nd)

How can you achieve this in Python?

+1  A: 

I'd probably do something like this, using the find function that takes an index parameter:

def find_nth(s, x, n):
    i = -1
    for _ in range(n):
        i = s.find(x, i + len(x))
        if i == -1:
            break
    return i

print find_nth('bananabanana', 'an', 3)

It's not particularly Pythonic I guess, but it's simple. You could do it using recursion instead:

def find_nth(s, x, n, i = 0):
    i = s.find(x, i)
    if n == 1 or i == -1:
        return i 
    else:
        return find_nth(s, x, n - 1, i + len(x))

print find_nth('bananabanana', 'an', 3)

It's a functional way to solve it, but I don't know if that makes it more Pythonic.

Mark Byers
`for _ in xrange(n):` can be used instead of `while n: ... n-=1`
J.F. Sebastian
@J.F. Sebastian: Yeah, I guess that's a little more Pythonic. I'll update.
Mark Byers
BTW: xrange is no longer needed in Python 3: http://diveintopython3.org/porting-code-to-python-3-with-2to3.html#xrange
Mark Byers
`return find_nth(s, x, n - 1, i + 1)` should be `return find_nth(s, x, n - 1, i + len(x))`. Not a big deal, but saves some computation time.
Dan Loewenherz
@dlo: Actually that can give different results in some cases: find_nth('aaaa','aa',2). Mine gives 1, yours gives 2. I guess yours is actually what the poster wants. I'll update my code. Thanks for the comment.
Mark Byers
+7  A: 

Mark's iterative approach would be the usual way, I think.

Here's an alternative with string-splitting, which can often be useful for finding-related processes:

def findnth(haystack, needle, n):
    parts= haystack.split(needle, n+1)
    if len(parts)<=n+1:
        return -1
    return len(haystack)-len(parts[-1])-len(needle)

And here's a quick (and somewhat dirty, in that you have to choose some chaff that can't match the needle) one-liner:

'foo bar bar bar'.replace('bar', 'XXX', 1).find('bar')
bobince
+1, excellent idea!
Alex Martelli
The first suggestion is going to be very inefficient for large strings when the match you're interested is near the beginning. It always looks at the whole string. It's clever but I wouldn't recommend this to someone who is new to Python and just wants to learn a good way to do it.
Mark Byers
Thanks, I like your one liner. I don't think it's the most instantly readable thing in the world, but it's not much worse then most others below
prestomation
+1 for the one-liner, this should help me right now. I had been thinking of doing the equivalent of `.rfind('XXX')`, but that would fall apart if `'XXX'` appears later in the input anyway.
Nikhil Chelliah
+5  A: 

Understanding that regex is not always the best solution, I'd probably use one here:

>>> import re
>>> s = "ababdfegtduab"
>>> [m.start() for m in re.finditer(r"ab",s)]
[0, 2, 11]
>>> [m.start() for m in re.finditer(r"ab",s)][2] #index 2 is third occurrence 
11
Mark Peters
The risk here of course is that the string to search for will contain special characters that will cause the regex to do something you didn't want. Using re.escape should solve this.
Mark Byers
This is clever, but is it really Pythonic? Seems like overkill for just finding the nth occurrence of a substring, and it's not exactly easy to read. Also, like you say, you have to import all of re for this
tgamblin
+1  A: 

Here is another approach using re.finditer.
The difference is that this only looks into the haystack as far as necessary

from re import finditer
from itertools import dropwhile
needle='an'
haystack='bananabanana'
n=2
next(dropwhile(lambda x: x[0]<n, enumerate(re.finditer(needle,haystack))))[1].start()
gnibbler
+5  A: 

Here's a more Pythonic version of the straightforward iterative solution:

def find_nth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start

Example:

>>> find_nth("foofoofoofoo", "foofoo", 2)
6

If you want to find the nth overlapping occurrence of needle, you can increment by 1 instead of len(needle), like this:

def find_nth_overlapping(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+1)
        n -= 1
    return start

Example:

>>> find_nth_overlapping("foofoofoofoo", "foofoo", 2)
3

This is easier to read than Mark's version, and it doesn't require the extra memory of the splitting version or importing regular expression module. It also adheres to a few of the rules in the Zen of python, unlike the various re approaches:

  1. Simple is better than complex.
  2. Flat is better than nested.
  3. Readability counts.
tgamblin
+1  A: 
>>> s="abcdefabcdefababcdef"
>>> j=0
>>> for n,i in enumerate(s):
...   if s[n:n+2] =="ab":
...     print n,i
...     j=j+1
...     if j==2: print "2nd occurence at index position: ",n
...
0 a
6 a
2nd occurence at index position:  6
12 a
14 a
A: 

Here's another re + itertools version that should work when searching for either a str or a RegexpObject. I will freely admit that this is likely over-engineered, but for some reason it entertained me.

import itertools
import re

def find_nth(haystack, needle, n = 1):
    """
    Find the starting index of the nth occurrence of ``needle`` in \
    ``haystack``.

    If ``needle`` is a ``str``, this will perform an exact substring
    match; if it is a ``RegexpObject``, this will perform a regex
    search.

    If ``needle`` doesn't appear in ``haystack``, return ``-1``. If
    ``needle`` doesn't appear in ``haystack`` ``n`` times,
    return ``-1``.

    Arguments
    ---------
    * ``needle`` the substring (or a ``RegexpObject``) to find
    * ``haystack`` is a ``str``
    * an ``int`` indicating which occurrence to find; defaults to ``1``

    >>> find_nth("foo", "o", 1)
    1
    >>> find_nth("foo", "o", 2)
    2
    >>> find_nth("foo", "o", 3)
    -1
    >>> find_nth("foo", "b")
    -1
    >>> import re
    >>> either_o = re.compile("[oO]")
    >>> find_nth("foo", either_o, 1)
    1
    >>> find_nth("FOO", either_o, 1)
    1
    """
    if (hasattr(needle, 'finditer')):
        matches = needle.finditer(haystack)
    else:
        matches = re.finditer(re.escape(needle), haystack)
    start_here = itertools.dropwhile(lambda x: x[0] < n, enumerate(matches, 1))
    try:
        return next(start_here)[1].start()
    except StopIteration:
        return -1
Hank Gay