views:

646

answers:

5

I'm trying to scrap a page in youtube with python which has lot of ajax in it

I've to call the java script each time to get the info. But i'm not really sure how to go about it. I'm using the urllib2 module to open URLs. Any help would be appreciated.

+3  A: 

Youtube (and everything else Google makes) have EXTENSIVE APIs already in place for giving you access to just about any and all data you could possibly want.

Take a look at The Youtube Data API for more information.

I use urllib to make the API requests and ElementTree to parse the returned XML.

Gabriel Hurley
+2  A: 

Main problem is, you're violating the TOS (terms of service) for the youtube site. Youtube engineers and lawyers will do their professional best to track you down and make an example of you if you persist. If you're happy with that prospect, then, on you head be it -- technically, your best bet are python-spidermonkey and selenium. I wanted to put the technical hints on record in case anybody in the future has needs like the ones your question's title indicates, without the legal issues you clearly have if you continue in this particular endeavor.

Alex Martelli
I recommend that people here try to be **programmers** and not **lawyers.** There must be other pages / sites for lawyers for sure. I understand that some of **you** might be on Mr. G's or other big brother's payroll. So if someone asks a technical question, please provide a technical answer first, then if you really have to, throw a short line of legal advice. **Let's keep this a relevant site**. Just a friendly hint to all of you, don't deviate from the question at hand.
VN44CA
Heh -- downvoting is anything **but** "a friendly hint" -- it's the most hostile thing you can _do_ on SO;-). You may realize how _ineffective_ it is against a leaderboard netizen: every day I max out (at 200 rep) from upvotes, so the downvote's -2 doesn't matter (like the upvotes after the first 20-or-so) to _me_ ... but it still costs _you_ 1 rep. Hostile **and** ineffective -- perfect complement for the utter stupidity of your rant (esp. when the only time _you_ ever got as many as two upvotes for an answer was exactly for expressing _your_ opinion on the legality of scraping!-).
Alex Martelli
I come here when I need help, not really counting my votes, nor care about my vote and counts. Sorry to have offended you though! Even though I stayed 100% professional and subjective! I'll buy you a beer next time I run into you in one of the upcoming events and we call it even. :) I wish stackflow had private messages though? Ciao
VN44CA
Since we're playing Internet lawyers, I wasn't aware that violating ToS was illegal (regardless of what Facebook has to say). So the worst they could do to you for scraping their site against their ToS would be banning, yes?
Ryan Ginstrom
A: 

As suggested, you should use the YouTube API to access the data made available legitimately.

Regarding the general question of scraping AJAX, you might want to consider the scrapy framework. It provides extensive support for crawling and scraping web sites and uses python-spidermonkey under the hood to access javascript links.

ars
A: 

Here is how I would do it: Install Firebug on Firefox, then turn the NET on in firebug and click on the desired link on YouTube. Now see what happens and what pages are requested. Find the one that are responsible for the AJAX part of page. Now you can use urllib or Mechanize to fetch the link. If you CAN pull the same content this way, then you have what you are looking for, then just parse the content. If you CAN'T pull the content this way, then that would suggest that the requested page might be looking at user login credentials, sessions info or other header fields such as HTTP_REFERER ... etc. Then you might want to look at something more extensive like the scrapy ... etc. I would suggest that you always follow the simple path first. Good luck and happy "responsibly" scraping! :)

VN44CA
A: 

You could sniff the network traffic with something like Wireshark then replay the HTTP calls via a scraping framework that is robust enough to deal with AJAX, such as scraPY.

Gaia