ansaurus

Question

Finding content between two words withou RegEx, BeautifulSoup, lXml ... etc

Answer 1

A:

Well, this is what it would be in PHP. No doubt there's a much sexier Pythonic way.

function FindBetweenText($before, $after, $text) {
    $before_pos = strpos($text, $before);
    if($before_pos === false)
        return null;
    $after_pos = strpos($text, $after);
    if($after_pos === false || $after_pos <= $before_pos)
        return null;
    return substr($text, $before_pos, $after_pos - $before_pos);
}

chaos 2009-07-12 15:04:39

Answer 2

+2 A:

if you are sure your markers are unique, do something like this

s="""
<html>
<body>
<div>StartYYYY "Extract HTML", ENDYYYY

</body>

Some Java Scripts code STARTXXXX "Extract JS Code" ENDXXXX.

</html>
"""

def FindBetweenText(startMarker, endMarker, text):
    startPos = text.find(startMarker)
    if startPos < 0: return
    endPos = text.find(endMarker)
    if endPos < 0: return

    return text[startPos+len(startMarker):endPos]

print FindBetweenText('STARTXXXX', 'ENDXXXX', s)

Anurag Uniyal 2009-07-12 15:18:04

text = "ppppYYYYyaddaXXXXblahYYYYqqqq"; FindBetweenText("XXXX", "YYYY", text) ... this produces '' but maybe the OP would prefer 'blah'

John Machin 2009-07-12 15:33:47

yes and there could be more complicated cases of marker embedded in marker, here I assume as OP said "unique markers"

Anurag Uniyal 2009-07-13 02:55:44

can you extend your example to take the date out of this blob of text? """<div id=bold>California, US</div><div id=bold>June 12, 2009</div><div id=bold>Status: Active</div>""" Note that location, date and status are variables and can be different. Example """US</div>""" or """2009</div>""" cannot be used as end tags. Thx

VN44CA 2009-07-13 04:35:38

you tell me the unique tags in the html and i will tell you the date

Anurag Uniyal 2009-07-13 06:44:12

Answer 3

A:

[Slightly tested]

def bracketed_find_first(prefix, suffix, page, start=0):
    prefixpos = page.find(prefix, start)
    if prefixpos == -1: return None # NOT ""
    startpos = prefixpos + len(prefix)
    endpos = page.find(suffix, startpos) # DRY
    if endpos == -1: return None # NOT ""
    return page[startpos:endpos]

Note: the above returns only the first occurrence. Here is a generator which yields each occurrence.

def bracketed_finditer(prefix, suffix, page, start_at=0):
    while True:
        prefixpos = page.find(prefix, start_at)
        if prefixpos == -1: return # StopIteration
        startpos = prefixpos + len(prefix)
        endpos = page.find(suffix, startpos)
        if endpos == -1: return
        yield page[startpos:endpos]
        start_at = endpos + len(suffix)

John Machin 2009-07-12 15:18:05

I test the first version you have and works well on a single occurrence.

VN44CA 2010-09-22 16:44:08

@VN44CA: I'm astonished that you accepted an answer that was not only later than mine but also recursive. Any particular reason?

John Machin 2010-09-22 19:31:02

Answer 4

A:

Here's my attempt, this is tested. While recursive, there should be no unnecessary string duplication, although a generator might be more optimal

def bracketed_find(s, start, end, startat=0):
    startloc=s.find(start, startat)
    if startloc==-1:
     return []
    endloc=s.find(end, startloc+len(start))
    if endloc == -1:
     return [s[startloc+len(start):]]
    return [s[startloc+len(start):endloc]] + bracketed_find(s, start, end, endloc+len(end))

and here is a generator version

def bracketed_find(s, start, end, startat=0):
    startloc=s.find(start, startat)
    if startloc==-1:
     return
    endloc=s.find(end, startloc+len(start))
    if endloc == -1:
     yield s[startloc+len(start):]
     return
    else:
     yield s[startloc+len(start):endloc]

    for found in bracketed_find(s, start, end, endloc+len(end)):
     yield found

polyglot 2009-07-12 15:29:56

ansaurus

tags:

views:

answers:

Finding content between two words withou RegEx, BeautifulSoup, lXml ... etc

related questions