views:

1097

answers:

5

I'm downloading a web page (tag soup HTML) with XMLHttpRequest and I want to take the output and turn it into a DOM object that I can then run XPATH queries on. How do I convert from a string into DOM object?

It appears that the general solution is to create a hidden iframe and throw the contents of the string into that. There has been talk of updating DOMParser to support text/html but as of Firefox 3.0.1 you still get an NS_ERROR_NOT_IMPLEMENTED if you try.

Is there any option besides using the hidden iframe trick? And if not, what is the best way to do the iframe trick so that your code works outside the context of any currently open tabs (so that closing tabs won't screw up code, etc)?

This is an example of why I'm looking for a solution other than the iframe hack, if I have to write all that code to have a robust solution, then I'd rather keep looking for something else.

+1  A: 

Try this:

var request = new XMLHttpRequest();

request.overrideMimeType( 'text/xml' );
request.onreadystatechange = process;
request.open ( 'GET', url );
request.send( null );

function process() {
    if ( request.readyState == 4 && request.status == 200 ) {
        var xml = request.responseXML;
    }
}

Notice the overrideMimeType and responseXML.
The readyState == 4 is 'completed'.

Steve Willard
This does not work if the response is not valid XML to begin with. If you tell Firefox to expect XML it will be strict about what it will parse.
thelsdj
A: 

Try creating a div

document.createElement( 'div' );

And then set the tag soup HTML to the innerHTML of the div. The browser should process that into XML, which then you can parse.

The innerHTML property takes a string that specifies a valid combination of text and elements. When the innerHTML property is set, the given string completely replaces the existing content of the object. If the string contains HTML tags, the string is parsed and formatted as it is placed into the document.

Steve Willard
The problem with this is that I need the entire HTML document, <head> and all which this would throw away.Also I'm trying to not use existing windows / tabs because my code runs outside the context of them and I want to be resistant to a user randomly closing a window or tab making my code get interrupted (assuming Firefox is still running).
thelsdj
A: 

So you want to download a webpage as an XML object using javascript, but you don't want to use a webpage? Since you have no control over what the user will do (closing tabs or windows or whatnot) you would need to do this in like a OSX Dashboard widget or some separate application. A Firefox extension would also work, unless you have to worry about the user closing the browser.

Steve Willard
Yes, I am using a Firefox extension, but most of the iframe examples use an arbitrary browser window rather than an object in the core process to be resistant to browser/tab closing.
thelsdj
+2  A: 

Ajaxian actually had a post on inserting / retrieving html from an iframe today. You can probably use the js snippet they have posted there.

As for handling closing of a browser / tab, you can attach to the onbeforeunload (http://msdn.microsoft.com/en-us/library/ms536907(VS.85).aspx) event and do whatever you need to do.

Darren Kopp
A: 

Is there any option besides using the hidden iframe trick?

Unfortunately, no, not now. Otherwise the microsummary code you point to would use it instead.

And if not, what is the best way to do the iframe trick so that your code works outside the context of any currently open tabs (so that closing tabs won't screw up code, etc)?

The code you quoted uses the recent browser window, so closing tabs won't affect parsing. Closing that browser window will abort your load, but you can deal with it (detect that the load is aborted and restart it in another window for example) and it doesn't happen very often.

You need a DOM window for the iframe to work properly, so there's no clean solution at the moment (if you're keen on using the mozilla parser).

Nickolay