ansaurus

Question

How to parse through script tag using python and beautifulsoup

Answer 1

+1 A:

You can't do it with BeautifulSoup alone. BeautifulSoup parses HTML as it would arrive to the browser (before any rewriting or DOM manipulation), and it does not parse (let alone execute) Javascript.

You might want to use a simple regular expression in this special case.

Triptych 2009-12-10 19:14:25

ok thanks i will try that

qaAutomation 2009-12-10 19:26:56

Answer 2

+1 A:

Pyparsing might help you bridge this mix of JS and HTML. This parser looks for document.write statements containing a quoted string or a string expression of several quoted strings and identifiers, quasi-evaluates the string expression, parses it for an embedded <frame> tag, and returns the frame attributes as a pyparsing ParseResults object, which gives you access to the named attributes as if they were object attributes or dict keys (your preference).

jssrc = """
<script language="javascript">
.
.
.
document.write('<frame name="nav" src="/nav/index_nav.html" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" border = "no" noresize>'); 
if (anchor != "") 
{  document.write('<frame name="body" src="http://content.members.fidelity.com/mfl/summary/0,,' + cusip + ',00.html?' + anchor + '" marginwidth="0" marginheight="0" scrolling="auto" frameborder="0" noresize>'); } 
else 
{  document.write('<frame name="body" src="http://content.members.fidelity.com/mfl/summary/0,,' + cusip + ',00.html" marginwidth="0" marginheight="0" scrolling="auto" frameborder="0" noresize>'); } 
document.write('</frameset>');
    // end hiding -->
    </script>"""

from pyparsing import *

# define some basic punctuation, and quoted string
LPAR,RPAR,PLUS = map(Suppress,"()+")
qs = QuotedString("'")

# use pyparsing helper to define an expression for opening <frame> 
# tags, which includes support for attributes also
frameTag = makeHTMLTags("frame")[0]

# some of our document.write statements contain not a sting literal,
# but an expression of strings and vars added together; define
# an identifier expression, and add a parse action that converts
# a var name to a likely value
ident = Word(alphas).setParseAction(lambda toks: evalvars[toks[0]])
evalvars = { 'cusip' : "CUSIP", 'anchor' : "ANCHOR" }

# now define the string expression itself, as a quoted string,
# optionally followed by identifiers and quoted strings added
# together; identifiers will get translated to their defined values
# as they are parsed; the first parse action on stringExpr concatenates
# all the tokens; then the second parse action actually parses the
# body of the string as a <frame> tag and returns the results of parsing
# the tag and its attributes; if the parse fails (that is, if the
# string contains something that is not a <frame> tag), the second
# parse action will throw an exception, which will cause the stringExpr
# expression to fail
stringExpr = qs + ZeroOrMore( PLUS + (ident | qs))
stringExpr.setParseAction(lambda toks : ''.join(toks))
stringExpr.addParseAction(lambda toks: 
    frameTag.parseString(toks[0],parseAll=True))

# finally, define the overall document.write(...) expression
docWrite = "document.write" + LPAR + stringExpr + RPAR

# scan through the source looking for document.write commands containing
# <frame> tags using scanString; print the original source fragment, 
# then access some of the attributes extracted from the <frame> tag
# in the quoted string, using either object-attribute notation or 
# dict index notation
for dw,locstart,locend in docWrite.scanString(jssrc):
    print jssrc[locstart:locend]
    print dw.name
    print dw["src"]
    print

Prints:

document.write('<frame name="nav" src="/nav/index_nav.html" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" border = "no" noresize>')
nav
/nav/index_nav.html

document.write('<frame name="body" src="http://content.members.fidelity.com/mfl/summary/0,,' + cusip + ',00.html?' + anchor + '" marginwidth="0" marginheight="0" scrolling="auto" frameborder="0" noresize>')
body
http://content.members.fidelity.com/mfl/summary/0,,CUSIP,00.html?ANCHOR

document.write('<frame name="body" src="http://content.members.fidelity.com/mfl/summary/0,,' + cusip + ',00.html" marginwidth="0" marginheight="0" scrolling="auto" frameborder="0" noresize>')
body
http://content.members.fidelity.com/mfl/summary/0,,CUSIP,00.html

Paul McGuire 2009-12-10 21:10:57

ansaurus

tags:

views:

answers:

How to parse through script tag using python and beautifulsoup

related questions