views:

120

answers:

4

I would like to get the HTML code from a page with PHP. So I do this:

$url = 'http://en.wikipedia.org/wiki/New_York_City';
$html = file_get_html($url);

The problem is, Wikipedia doesn't send the <script> tag to the PHP request, so it doesn't show the JavaScript. I guess that's because Wikipedia sees that the "requester" doesn't have JavaScript enabled, so it doesn't send the <script> tags.

How can I let Wikipedia know that my PHP is JavaScript enabled?

I heard about stream context, but I don't know how to set JavaScript enabled for it.

A: 

You could use an Iframe.

You could also use something like jQuery to grab the page (or certain parts of the page) onto your website.

mmundiff
wow, some people really do believe jquery is the answer to everything. He does specify he is looking for a PHP solution.
Tom Castle
Also, you can't scrape another page with JavaScript alone... http://en.wikipedia.org/wiki/Same_origin_policy
Domenic
+1  A: 

It looks like the file_get_html() function is stripping away the <script> blocks, because I tried to request GET /wiki/Main_Page HTTP/1.1 from Fiddler without any request headers, and it did return the <script> blocks in the response.

Daniel Vassallo
It's doing the same with file_get_contents. Could it depends on the user-agent?
Davide
@DavidDev: I tried in Fiddler without the user-agent header, and I still received the `<script>` blocks. It could in theory serve different content according to the user-agent, but I doubt wikipedia is doing that. It would complicate their caching processes.
Daniel Vassallo
Hmm I have no idea. I'll try on another webserver. Anyway, thank you!
Davide
+2  A: 

This should work

$url = 'http://en.wikipedia.org/wiki/New_York_City';
$html = file_get_contents($url);

Tested it on my local PHP server.

Sergiy Byelozyorov
It doesn't work for me. It's always getting the html without any javascript. So maybe it depends on the configuration of the server? I'm testing it in this free hosting http://www.freewebhostingarea.com/phpinfo-default_variables.html <- phpinfo()
Davide
A: 

Thanks to symcbean, here's the solution.

I added:

ini_set('user_agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9');

And now it's sending the corret script block.

;)

Davide