ansaurus

Question

Using Python/Selenium/Best Tool For The Job to get URI of image requests generated through JavaScript?

Answer 1

A:

Why can't you just read suffix, or url for that matter? Is the image loaded in an iframe or in your page?

If it is loaded in your page, then this may be a dirty hack (substitute document.body for whatever element is considered):

var ac = document.body.appendChild;
var sources = [];

document.body.appendChild = function(child) {
    if (/^img$/i.test(child.tagName)) {
        sources.push(child.getAttribute('src'));
    }
    ac(child);
}

Paul 2009-11-13 16:19:42

The idea is to load the page using a script from the commandline; in such a case, I wouldn't have the opportunity to modify the JavaScript.

Zachary Burt 2009-11-13 18:02:17

OK. I misunderstood.

Paul 2009-11-14 13:25:11

Answer 2

+3 A:

I would pick any one of the many http proxy servers written in Python -- probably one of the simplest ones at the very top of the list -- and tweak it to record all URLs requested (as well as proxy-serve them) e.g. appending them to a text file -- without loss of generality, call that text file 'XXX.txt'.

Now all you need is a script that: starts the proxy server in question; starts Firefox (or whatever) on your main desired URL with the proxy in question set as your proxy (see e.g. this SO question for how), though I'm sure other browsers would work just as well; waits a bit (e.g. until the proxy's XXX.txt file has not been altered for more than N seconds); reads XXX.txt to extract only the URLs you care about and record them wherever you wish; turns down the proxy and Firefox processes.

I think this will be much faster to put in place and make work correctly, for your specific requirements, than any more general solution based on qtwebkit, selenium, or other "automation kits".

Alex Martelli 2009-11-15 18:54:41

Answer 3

+4 A:

The simplest thing to do might be to use something like HtmlUnit and skip a real browser entirely. By using Rhino, it can evaluate JavaScript and likely be used to extract that URL out.

That said, if you can't get that working, try out Selenium RC and use the captureNetworkTraffic command (which requires the Selenium instant be started with an option of captureNetworkTraffic=true). This will launch Firefox with a proxy configured and then let you pull the request info back out as JSON/XML/plain text. Then you can parse that content and get what you want.

Try out the instant test tool that my company offers. If the data you're looking for is in our results (after you click View Details), you'll be able to get it from Selenium. I know, since I wrote the captureNetworkTraffic API for Selenium for my company, BrowserMob.

Patrick Lightbody 2009-11-16 17:52:00

Answer 4

A:

Use Firebug Firefox plugin. It will show you all requests in real time and you can even debug the JS in your Browser or run it step-by-step.

Elalfer 2009-11-20 05:42:07

Answer 5

+1 A:

Ultimately, I did it in python, using Selenium-RC. This solution requires the python files for selenium-rc, and you need to start the java server ("java -jar selenium-server.jar")

from selenium import selenium
import unittest
import lxml.html

class TestMyDomain(unittest.TestCase):
    def setUp(self):
        self.selenium = selenium("localhost", \
            4444, "*firefox", "http://www.MyDomain.com")
        self.selenium.start()

    def test_mydomain(self):

        htmldoc = open('site-list.html').read()
        url_list = [link for (element, attribute,link,pos) in lxml.html.iterlinks(htmldoc)]
        for url in url_list:

            try: 
                sel = self.selenium
                sel.open(url)        
                sel.select_window("null")
                js_code = '''
                myDomainWindow = this.browserbot.getUserWindow();
                for(obj in myDomainWindow) {  

                   /* This code grabs the OMNITURE tracking pixel img */
                    if ((obj.substring(0,4) == 's_i_') && (myDomainWindow[obj].src)) {             
                        var ret = myDomainWindow[obj].src;
                    } 
                }        
                ret;
                '''
                omniture_url = sel.get_eval(js_code) #parse&process this however you want


            except Exception, e:
                print 'We ran into an error: %s' % (e,)


        self.assertEqual("expectedValue", observedValue)


    def tearDown(self):
        self.selenium.stop()

if __name__ == "__main__":
    unittest.main()

Zachary Burt 2009-11-20 15:17:50

ansaurus

tags:

views:

answers:

Using Python/Selenium/Best Tool For The Job to get URI of image requests generated through JavaScript?

related questions