I want to be able to grab an HTML page and parse it using only Javascript, nothing touches the server.
Assuming I can get the html response (solved the cross-domain issues), how can I use jQuery on the complete html document?
Example is like this (here is a full gist with a remote example):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<title>Parent Page wanting to Parse Children</title>
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js"></script>
<meta name="keywords" content="parent, html, parsing">
</head>
<body>
<script type="text/javascript">
$(document).ready(function() {
// data looks like this:
var html = ""
html += '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">'
html += '<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">'
html += ' <head> '
html += ' <title>Sub Page to Parse</title> '
html += ' <script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js"><\/script>'
html += ' <meta name="keywords" content="parent, html, parsing"> '
html += ' </head> '
html += ' <body> '
html += ' <script type="text/javascript"> '
html += ' alert("im javascript"); '
html += ' setTimeout(function() { '
html += ' $("body").css("background-color", "#ffaaaa") '
html += ' }, 400); '
html += ' <\/script> '
html += ' <div id="child_div"></div> '
html += ' </body> '
html += '</html>'
// this works fine:
// $("#parent_div").append(html);
// $("#child_div")
// .width(100)
// .height(100)
// .css("background-color", "yellow")
// .append("<p>child text</p>");
// ... but that's not what I am trying to do...
// reason being: i don't want to add this sub-html page to the dom...
// I just want to scrape it for data...
// I want to do this, but I am getting null for every case:
var meta = $(html).find("meta");
alert(meta.html());
var title = $(html).find("title");
alert(title.html());
});
</script>
<div id="parent_div"></div>
</body>
</html>
The problem is, var child_body = $(data).find("body");
doesn't give me anything. I'm not sure how I should be going about traversing this complete html document using jQuery. I have tried to remove the <!DOCTYPE...>
tag, but that doesn't do much.
Is something like this possible?
I have been messing around with John Resig's Javascript HTML Parser but that doesn't quite cover it yet.
Is there an XPath javascript library that would be more suitable?