views:

74

answers:

4

I want to be able to grab an HTML page and parse it using only Javascript, nothing touches the server.

Assuming I can get the html response (solved the cross-domain issues), how can I use jQuery on the complete html document?

Example is like this (here is a full gist with a remote example):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"&gt;
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
  <head>
    <title>Parent Page wanting to Parse Children</title>
    <script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js"&gt;&lt;/script&gt;
    <meta name="keywords" content="parent, html, parsing">
  </head>
  <body>
    <script type="text/javascript">
      $(document).ready(function() {
        //  data looks like this:
        var html = ""
        html += '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"&gt;'
        html += '<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">'
        html += '  <head>                                                        '
        html += '    <title>Sub Page to Parse</title>                            '
        html += '    <script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js"&gt;&lt;\/script&gt;'
        html += '    <meta name="keywords" content="parent, html, parsing">      '
        html += '  </head>                                                       '
        html += '  <body>                                                        '
        html += '    <script type="text/javascript">                             '
        html += '      alert("im javascript");                                   '
        html += '      setTimeout(function() {                                   '
        html += '        $("body").css("background-color", "#ffaaaa")            '
        html += '      }, 400);                                                  '
        html += '    <\/script>                                                  '
        html += '    <div id="child_div"></div>                                  '
        html += '  </body>                                                       '
        html += '</html>'

// this works fine:
//        $("#parent_div").append(html);
//        $("#child_div")
//          .width(100)
//          .height(100)
//          .css("background-color", "yellow")
//          .append("<p>child text</p>");
// ... but that's not what I am trying to do...

// reason being: i don't want to add this sub-html page to the dom...
// I just want to scrape it for data...

// I want to do this, but I am getting null for every case:
        var meta = $(html).find("meta");
        alert(meta.html());
        var title = $(html).find("title");
        alert(title.html());

      });
    </script>
    <div id="parent_div"></div>
  </body>
</html>

The problem is, var child_body = $(data).find("body"); doesn't give me anything. I'm not sure how I should be going about traversing this complete html document using jQuery. I have tried to remove the <!DOCTYPE...> tag, but that doesn't do much.

Is something like this possible?

I have been messing around with John Resig's Javascript HTML Parser but that doesn't quite cover it yet.

Is there an XPath javascript library that would be more suitable?

A: 

Try to capture the data with .html(data) first.

(haven't tried, it's just a thought)

Sylverdrag
A: 

Given that you have some HTML markup as a JavaScript string, you can hand it to jQuery and parse it.

var tagSoup = '<html><head>.and so on..</html>';

var tag$ = $(tagSoup);

var someValue = tag$.find('#someId).val();
AndrewDotHay
A: 

How about using a DocumentFragment? You'll still likely need to futz with the text you get, but you've at least offloaded the parsing to the browser, which hopefully knows what it's doing. It's also not in the page's DOM.

lawnsea
He's already doing this, `$(html)` creates a document fragment :)
Nick Craver
Didn't know that. Thanks for the info!
lawnsea
+3  A: 

The problem isn't in jQuery exactly, but the differences in browser .innerHTML implementation. Different browsers handle this in different ways, for example in Opera your example works fine, Firefox can work with adjusting, in IE8 it half-works with adjustment, and Chrome strips everyting.

It's all about how they handle the .innerHTML call, this is what jQuery uses internally when creating document fragments.

Here's a quick test page using the exact HTML you have, and the results from a few browsers:


Chrome 6 (runs the alert(), strips almost everything):

<div id="child_div"></div>
  • Results:
    • Entire <head> and contents stripped, nothing to get

IE8 (Runs the alert(), it retains the <meta>, but as a top level element, test it in IE here):

<META name=keywords content="parent, html, parsing">
<DIV id=child_div></DIV>
  • Results:
    • $(html).filter("meta").attr("name"): "keywords"
    • <title> was stripped

Firefox 3.6 (Runs the alert(), retains <head> contents but again as top-level elements, test it here):

<title>Sub Page to Parse</title>
<meta name="keywords" content="parent, html, parsing">
<div id="child_div"></div> 
  • Results:
    • $(html).filter("meta").attr("name"): "keywords"
    • $(html).filter("title").html() : "Sub Page to Parse"

Opera 10.6 (Runs the alert(), strips only scripts, test it here):

<head> 
  <title>Sub Page to Parse</title>
  <meta name="keywords" content="parent, html, parsing"> 
</head> 
<div id="child_div"></div>
  • Results:
    • $(html).find("meta").attr("name") : "keywords"
    • $(html).find("title").html() : "Sub Page to Parse"

So the problem isn't jQuery per-say, but the what different browsers are doing in their .innerHTML methods to strip out whatever they want. This makes parsing anything in the <head> particularly unreliable, notice when it's retained at all, it may or may not be a top level element, for example $(html).length will vary.

I would say you have two options here, but neither of which seem too appealing:

  • Make the request via a server-side call, it gets out the info you want
  • Parse the HTML yourself, but you won't much any benefit from jQuery in that department

Sorry the answers sucks, but it seems cross-browser issues, unless you parse it yourself, are going to be killer here, and make jQuery just about useless.

Nick Craver