views:

845

answers:

6

I am developing my first Firefox extension and for that I need to get the complete source code of the current page. How can I do that with XUL?

A: 

Maybe you can get it via DOM, using

var source =document.getElementsByTagName("html");

and fetch the source using DOMParser

https://developer.mozilla.org/En/DOMParser

Manuel Bitto
getElementsByTagName (note: elements)
N 1.1
+1  A: 

Hello everybody,

really looks like there is no way to get "all the sourcecode". You may use

document.documentElement.innerHTML

to get the innerHTML of the top element (usually html). If you have a php error message like

<h3>fatal error</h3>
segfault

<html>
    <head>
        <title>bla</title>
        <script type="text/javascript">
            alert(document.documentElement.innerHTML);
        </script>
    </head>
    <body>
    </body>
</html>

the innerHTML would be

<head>
<title>bla</title></head><body><h3>fatal error</h3>
segfault    
        <script type="text/javascript">
            alert(document.documentElement.innerHTML);
        </script></body>

but the error message would still retain

edit: documentElement is described here: https://developer.mozilla.org/en/DOM/document.documentElement

henchman
This might be what I'm looking for. However, I don't understand the example code you posted. Is the second block supposed to be the text printed via `alert` in the first block? If so, why would the error message suddenly appear inside the `body` tag?
Franz
yep, the second code block was the code being alerted. Thats probably firefox's code correction. Just copy the first block into an empty html-file and try it out :-)
henchman
This is not the complete source. As you noted, everything that's not between `<html>` and `</html>` doesn't get included. Lachlan's answer seems to be a much better solution.
MatrixFrog
+1  A: 

You can get URL with var URL = document.location.href and navigate to "view-source:"+URL.

Now you can fetch the whole source code (viewsource is the id of the body):

var code = document.getElementById('viewsource').innerHTML;

Problem is that the source code is formatted. So you have to run strip_tags() and htmlspecialchars_decode() to fix it.

For example, line 1 should be the doctype and line 2 should look like:

&lt;<span class="start-tag">HTML</span>&gt;

So after strip_tags() it becomes:

&lt;HTML&gt;

And after htmlspecialchars_decode() we finally get expected result:

<HTML>

The code doesn't pass to DOM parser so you can view invalid HTML too.

Sagi
Hmmm... sounds pretty good. Is the entire code wrapped in an element with ID `viewsource` or why are you doing it that way? And what do you mean by "formatted"? Are the entities escaped?
Franz
Think of it as a normal HTML code. The body id is viewsource. I've added example how it looks. I hope that you have some ideas how to go this page (you can do it with hidden iframe, for example).
Sagi
Or you could just use `.textContent` instead.
Eli Grey
@Eli: Huh?@Sagi: Ah, thanks for the explanation. I'll try this tonight.
Franz
Franz: You don't need all of that. Just use `document.getElementById('viewsource').textContent`
Eli Grey
@Eli Grey - Thanks. I verified and it works. However, comments are striped.
Sagi
I'll post it as an answer then that you can choose.
Eli Grey
+2  A: 

You will need a xul browser object to load the content into.

Load the "view-source:" version of your page into a the browser object, in the same way as the "View Page Source" menu does. See function viewSource() in chrome://global/content/viewSource.js. That function can load from cache, or not.

Once the content is loaded, the original source is given by:

var source = browser.contentDocument.getElementById('viewsource').textContent;

Serialize a DOM Document
This method will not get the original source, but may be useful to some readers.

You can serialize the document object to a string. See Serializing DOM trees to strings in the MDC. You may need to use the alternate method of instantiation in your extension.

That article talks about XML documents, but it also works on any HTML DOMDocument.

var serializer = new XMLSerializer();
var source = serializer.serializeToString(document);

This even works in a web page or the firebug console.

Lachlan Roche
This looks pretty complete, too. What happens if the XHTML is broken due to some error, though?
Franz
The DOM parser will already have dealt with broken HTML, so seriaizer will not see the broken source.
Lachlan Roche
That would probably be bad then? Does the `document` variable have the property `textContent`, too?
Franz
Your edit looks veeery interesting. If this works out, this should be it.
Franz
@Franz did this work out?
Lachlan Roche
Haven't yet had time to check. I will do so, though, before the bounty runs out. Don't worry ;)
Franz
I feel really stupid now. I can't get the browser class to work. How can I create that kind of object?
Franz
"view source" creates it via XUL, see 'chrome://global/content/viewSource.xul'
Lachlan Roche
I'm experimenting with this solution and it seems to be working perfectly so far! Thank you Lachlan! @Franz I would think that creating a new one (`document.createElement('browser')`) should work, but you can also just put it in your main overlay XUL: `<browser id="invisibleBrowser" collapsed="true"/> <!-- This is never actually shown. It's just used for getting the raw source of HTML pages -->` and then of course, in your js file: `var browser = document.getElementById('invisibleBrowser')`
MatrixFrog
A: 

The first part of Sagi's answer, but use document.getElementById('viewsource').textContent instead.

Eli Grey
A: 

More in line with Lachlan's answer, but there is a discussion of the internals here that gets quite in depth, going into the Cpp code.

http://www.mail-archive.com/[email protected]/msg05391.html

and then follow the replies at the bottom.

Daniel