views:

71

answers:

3

I have to parse a page that is using javascript functions for its display and so we cant see any data in its source code.But if i save that page to my system than it shows everything in its source. Please suggest me some way out so that i can parse that page.

A: 

Un-possible! :p

Unless you have a javascript interpreter inside PHP which then compiles a new html page.

Tor Valamo
A: 

SEE How do you screen scrape ajax pages?

micahwittman
A: 

If your actual page content is generated by some Javascript code, you'll have to have that JS code interpreted to get the content -- and that's not quite that simple.

Using a Javascript interpreter from PHP (like Spidermonkey, maybe via the Spidermonkey PECL extension) might be an idea... But it will probably not work if the JS code relies on any functionnality that's provided by the browser -- and that's probably the case.


Maybe an idea might be to launch a real browser so it renders the page, and when that's done, fetch the HTML displayed by the browser ?

This could be done using, for instance, Selenium RC -- but as it requires launching an actual browser, it requires a machine with a graphical interface (i.e. not a "server"), and takes lots of time...

Still, if you don't have to many pages to scrappe, that might be a solution -- and it's certainly the way that will get you a rendering that's the closest possible from a browser... As you'll be using one ^^

Pascal MARTIN
developer, your browser contains is a javascript interpreter, which is why you can see the output from the interpreted JS code when using View Source. The cURL library only deals with internet protocol server response data - HTML for example - not the results of the DOM affected by javascript delivered in the http response. You have to programmatically analyze the JS code or run it through an interpreter and analyze the output.
micahwittman
@developer : if the content of your page is generated by some JS code, you need to interpret that JS code ; else, you won't get the content ; curl will only get you the HTML+JS content that's generated by the server, and will absolutly not interpret the JS code -- which means the JS-generated content will not be there. If you can't use anything that'll interpret JS code, it will be quite hard to get the content generated by that JS code ^^
Pascal MARTIN
ok but do you think WGET will help me in storing this page to a variable or something.(this is what someone just told me)
developer
curl, or wget, or whatever browsing/downloading tool you want will get you the HTML code sent by the server. The JS code being executed on the client (inside the browser), using wget or curl will not change a thing.
Pascal MARTIN