views:

534

answers:

9

I have a new project I am working on that involves fetching a webpage, (using PHP and cURL) parsing the HTML and javascript out of it and then handling the data in the results.

Basically I hit a brick wall when the site uses javascript to fetch its data by AJAX. In this case, the initial data will not appear in the fetched page unless the javascript is run in a browser.

Are there any PHP libraries for this? (I suspect not, but I could be wrong.)

I would really rather build this as a server-based solution, otherwise I am forced to have to build an application for this and using mozilla and/or IE runtime libraries - which kind of defeats the purpose.

A: 

you could take a look in rhino. It use java, never heard of a PHP port.

Are you obligated to run the actual javascript?

RageZ
+3  A: 

You can run a JavaScript engine such as Rhino on a server.

Here's a few alternatives:

  • Rhino (Java based)
  • V8 (Used by Chrome, C++)
  • SquirrelFish (C++)

While these can run JS, I'm not sure if what you do is the best approach. However, since you haven't specified the purprose of your program I can't offer any suggestions with that regard.

Jani Hartikainen
Not sure about the others, but Rhino won't be able to run most client-side JavaScript on its own, because it doesn't implement the DOM.
Ben Dunlap
A: 

Tbh you will have a harder time of just using a JS engine as you also have to create the environment of a browser scripting engine such as the DOM and window objects. If you are running on a Windows server then you could fairly easily use the IE COM objects to load and execute the web page, accessing the DOM programatically and pulling the contents back out. As for your server being Linux and/or Mozilla I unfortunately have no experience.

But really what are you trying to do?

tyranid
+3  A: 

You'll have to go one step further than Rhino if you want to execute real live web pages, because the JavaScript on those pages will expect to be able to use objects that are native to a browser environment. A server-side JavaScript engine like Rhino won't have those objects.

John Resig (creator of jQuery) started a project called Env.js a couple of years ago; it might be what you're looking, for but I suspect you'll have a tough time getting consistent results from a wide variety of web pages. Here's his initial blog post about it:

http://ejohn.org/blog/bringing-the-browser-to-the-server/

Some similar projects are named in that post's comments.

Ben Dunlap
+2  A: 

Previously asked here: headless internet browser?

At Mozilla we get this question a lot. There's no good answer. What you want is a software library that implements pretty much everything a browser needs to do (at least as far as networking, JavaScript, HTML parsing, and the DOM), but with no display.

The closest thing I know of is HTMLUnit (in Java).

Jason Orendorff
this I have already done.. with cURL but without the javascript handling.
Talvi Watia
A: 

haha... nevermind.. this will work

PHP DOM LIBRARY

'course it would be nice to have an easy way to map each javascript function to the DOM element it represents. part of jQuery may work for that. I'll have to see after some further tests.

Talvi Watia
You specifically said: "The webpage would be parsed for HTML and any javascript would need to be interpreted into a DOM model". This requires a JavaScript interpreter. If you wanted an XML parser there are numerous options in PHP including the DOM lib.
bucabay
thanks for the vote down.. but this project is beyond just XML parsing. I was looking for a possible solution that did not force me to use the Mozilla or IE DOM runtimes and have to build an application. mapping javascript through the PHP DOM will allow me to deal with the AJAX now.
Talvi Watia
+6  A: 

You will need:

  • one JavaScript interpreter
  • one DOM Level 2 Core and HTML implementation
  • 500g of non-standard but commonly-used DOM extensions
  • a pinch of DOM Level 2 Style (which might mean also a CSS interpreter and layout engine)
  • yoghurt pots, round-ended scissors and sticky-back plastic

Once you have assembled your components (remember to get a grown-up to help you with the sandboxing), you'll find what you have is essentially indistinguishable from a web browser.

JAVA is not part of the shell build on the server. V8/SquirrelFish is C++ code I would need to convert to PHP.

Porting a JS engine to PHP would be a huge task, and the resulting performance likely horrible. You can't even really get away with a nearly-solution on JavaScript any more, since so many pages are using hideously complex libraries like jQuery to do everything, which will require in-depth JS support.

I don't think you're going to be able to do this purely in PHP. You'll have to hook up Java/Rhino/HTMLUnit or a proper web browser like Mozilla. If your hosting environment doesn't give you the flexibility you need to compile and deploy that sort of thing, you'd have to move to a better hosting setup with a shell (preferably VPS).

If you can avoid this unpleasantness some other way, by special-casing known pages' AJAX access, do that.

bobince
the yoghurt pots, round-ended scissors and sticky-back plastic worked!
Talvi Watia
+1  A: 

I know you have said no Java, but for reference you might be interested in QT Jaambi. They have an implementation of webkit which yo ucan run in headless mode.

Joel
+1  A: 

All these answers seem to presume that there is no possibility of php JavaScript emulation, but there is a near-fully-compliant open-source php JavaScript emulator here:

http://www.sitepoint.com/blogs/2006/01/19/j4p5-javascript-for-php5/

Combined with Env.js, you could get pretty close to a full server-side js execution solution.

Nick Lockwood
not bad, I'll check it out.
Talvi Watia