views:

621

answers:

4

How to find out the content between two words or two sets of random characters?

The scraped page is not guaranteed to be Html only and the important data can be inside a javascript block. So, I can't remove the JavaScript.

consider this:

<html>
<body>
<div>StartYYYY "Extract HTML", ENDYYYY

</body>

Some Java Scripts code STARTXXXX "Extract JS Code" ENDXXXX.

</html>

So as you see the html markup may not be complete. I can fetch the page, and then without worrying about anything, I want to find the content called "Extract the name" and "Extract the data here in a JavaScript".

What I am looking for is in python:

Like this:

data = FindBetweenText(UniqueTextBeforeContent, UniqueTextAfterContent, page)

Where page is downloaded and data would have the text I am looking for. I rather stay away from regEx as some of the cases can be too complex for RegEx.

A: 

Well, this is what it would be in PHP. No doubt there's a much sexier Pythonic way.

function FindBetweenText($before, $after, $text) {
    $before_pos = strpos($text, $before);
    if($before_pos === false)
        return null;
    $after_pos = strpos($text, $after);
    if($after_pos === false || $after_pos <= $before_pos)
        return null;
    return substr($text, $before_pos, $after_pos - $before_pos);
}
chaos
+2  A: 

if you are sure your markers are unique, do something like this

s="""
<html>
<body>
<div>StartYYYY "Extract HTML", ENDYYYY

</body>

Some Java Scripts code STARTXXXX "Extract JS Code" ENDXXXX.

</html>
"""

def FindBetweenText(startMarker, endMarker, text):
    startPos = text.find(startMarker)
    if startPos < 0: return
    endPos = text.find(endMarker)
    if endPos < 0: return

    return text[startPos+len(startMarker):endPos]

print FindBetweenText('STARTXXXX', 'ENDXXXX', s)
Anurag Uniyal
text = "ppppYYYYyaddaXXXXblahYYYYqqqq"; FindBetweenText("XXXX", "YYYY", text) ... this produces '' but maybe the OP would prefer 'blah'
John Machin
yes and there could be more complicated cases of marker embedded in marker, here I assume as OP said "unique markers"
Anurag Uniyal
can you extend your example to take the date out of this blob of text? """<div id=bold>California, US</div><div id=bold>June 12, 2009</div><div id=bold>Status: Active</div>""" Note that location, date and status are variables and can be different. Example """US</div>""" or """2009</div>""" cannot be used as end tags. Thx
VN44CA
you tell me the unique tags in the html and i will tell you the date
Anurag Uniyal
A: 

[Slightly tested]

def bracketed_find_first(prefix, suffix, page, start=0):
    prefixpos = page.find(prefix, start)
    if prefixpos == -1: return None # NOT ""
    startpos = prefixpos + len(prefix)
    endpos = page.find(suffix, startpos) # DRY
    if endpos == -1: return None # NOT ""
    return page[startpos:endpos]

Note: the above returns only the first occurrence. Here is a generator which yields each occurrence.

def bracketed_finditer(prefix, suffix, page, start_at=0):
    while True:
        prefixpos = page.find(prefix, start_at)
        if prefixpos == -1: return # StopIteration
        startpos = prefixpos + len(prefix)
        endpos = page.find(suffix, startpos)
        if endpos == -1: return
        yield page[startpos:endpos]
        start_at = endpos + len(suffix)
John Machin
I test the first version you have and works well on a single occurrence.
VN44CA
@VN44CA: I'm astonished that you accepted an answer that was not only later than mine but also recursive. Any particular reason?
John Machin
A: 

Here's my attempt, this is tested. While recursive, there should be no unnecessary string duplication, although a generator might be more optimal

def bracketed_find(s, start, end, startat=0):
    startloc=s.find(start, startat)
    if startloc==-1:
     return []
    endloc=s.find(end, startloc+len(start))
    if endloc == -1:
     return [s[startloc+len(start):]]
    return [s[startloc+len(start):endloc]] + bracketed_find(s, start, end, endloc+len(end))

and here is a generator version

def bracketed_find(s, start, end, startat=0):
    startloc=s.find(start, startat)
    if startloc==-1:
     return
    endloc=s.find(end, startloc+len(start))
    if endloc == -1:
     yield s[startloc+len(start):]
     return
    else:
     yield s[startloc+len(start):endloc]

    for found in bracketed_find(s, start, end, endloc+len(end)):
     yield found
polyglot