views:

779

answers:

7

Hi,

I would like to get the dimensions (coordinates) for all the HTML elements of a webpage as they are rendered by a browser, that is the positions they are rendered at. For example, (top-left,top-right,bottom-left,bottom-right)

Could not find this in lxml. So, is there any library in Python that does this? I had also looked at Mechanize::Mozilla in Perl but, that seems difficult to configure/set-up.

I think the best way to do this for my requirement is to use a rendering engine - like WebKit or Gecko.

Are there any perl/python bindings available for the above two rendering engines? Google searches for tutorials on how to "plug-in" to the WebKit rendering engine is not very helpful.

+3  A: 

lxml isn't going to help you at all. It isn't concerned about front-end rendering at all.

To accurately work out how something renders, you need to render it. For that you need to hook into a browser, spawn the page and run some JS on the page to find the DOM element and get its attributes.

It's totally possible but I think you should start by looking at how website screenshot factories work (as they'll share 90% of the code you need to get a browser launching and showing the right page).

You may want to still use lxml to inject your javascript into the page.

Oli
Thanks! I looked at Webkit (Pywebkitgtk) for rendering. But, it currently does not support getting the DOM - http://code.google.com/p/pywebkitgtk/issues/detail?id=13
Bart J
Manipulate the HTML before passing it to the browser. Add in a javascript block to AJAX the correct data back to you.
Oli
Actually, I am trying to find examples of using rendering engines (be it Gecko, Webkit). But, a tutorial is almost impossible to find.
Bart J
A: 

The problem is that current browsers don't render things quite the same. If you're looking for the standards compliant way of doing things, you could probably write something in Python to render the page, but that's going to be a hell of a lot of work.

You could use the wxHTML control from wxWidgets to render each part of a page individually to get an idea of it's size.

If you have a Mac you could try WebKit. That same article has some suggestions for solutions on other platforms too.

Jon Cage
+1  A: 

I agree with Oli, rendering the page in question and inspecting DOM via JavaScript is the most practical way IMHO.

You might find jQuery very useful here:

$(document).ready(function() {
    var elem = $("div#some_container_id h1")
    var elem_offset = elem.offset();
    /* elem_offset is an object literal:
       elem_offset = { x: 25, y: 140 }
    */
    var elem_height = elem.height();
    var elem_width = elem.width();
    /* bottom_right is then
       { x: elem_offset.x + elem_width,
         y: elem_offset.y + elem_height }
});

Related documentation is here.

muhuk
+1  A: 

Yes, Javascript is the way to go:

var allElements=document.getElementsByTagName("*"); will select all the elements in the page.

Then you can loop through this a extract the information you need from each element. Good documentation about getting the dimensions and positions of an element is here.

getElementsByTagName returns a nodelist not an array (so if your JS changes your HTML those changes will be reflected in the nodelist), so I'd be tempted to build the data into an AJAX post and send it to a server when it's done.

edeverett
A: 

You might consider looking at WWW::Selenium. With it (and selenium rc) you can puppet string IE, Firefox, or Safari from inside of Perl.

Chas. Owens
The reason, I am looking to directly plug-in to a rendering engine of the browser is because, I have to test with atleast a million URLs and , I don't think using Selenium, etc. would be very efficient with that. :)
Bart J
ya, it wouldn't :) fortunately you could adapt one of the pyjamas-desktop runtimes to simply create a GUI that did not _actually_ display on-screen (aka "headless" usage). if you really wanted to get serious about resources then creating a headless version of pythonwebkit (not starting up GTK *at all*) would be a good way to go. it would take me about 2 weeks to do the programming: contact me if you're prepared to contract me to do the work (i'm easy to find: google "luke leighton").
A: 

I was not able to find any easy solution (ie. Java/Perl/Python :) to hook onto Webkit/Gecko to solve the above rendering problem. The best I could find was the Lobo rendering engine written in Java which has a very clear API that does exactly what I want - access to both DOM and the rendering attributes of HTML elements.

JRex is a Java wrapper to Gecko rendering engine.

Bart J
A: 

you have three main options:

1) http://www.gnu.org/software/pythonwebkit is webkit-based;

2) python-comtypes for accessing MSHTML (windows only)

3) hulahop (python-xpcom) which is xulrunner-based

you should get the pyjamas-desktop source code and look in the pyjd/ directory for "startup" code which will allow you to create a web browser application and begin, once the "page loaded" callback has been called by the engine, to manipulate the DOM.

you can perform node-walking, and can access the properties of the DOM elements that you require. you can look at the pyjamas/library/pyjamas/DOM.py module to see many of the things that you will need to be using in order to do what you want.

but if the three options above are not enough then you should read the page http://wiki.python.org/moin/WebBrowserProgramming for further options, many of which have been mentioned here by other people.

l.

I haven't tried pythonwebkit (which got released a few days back)...but, it sure does look promising.
Bart J