



I am working on screen scraping, and want to retrieve the source code a particular page.

How can achieve this with javascript? Please help me.

As a security measure, Javascript can't read files from different domains. Though there might be some strange workaround for it, I'd consider a different language for this task.

If you absolutely need to use javascript, you could load the page source with an ajax request.

Note that with javascript, you can only retrieve pages that are located under the same domain with the requesting page.

You could simply use XmlHttp (AJAX) to hit the required URL and the HTML response from the URL will be available in the responseText property. If it's not the same domain, your users will receive a browser alert saying something like "This page is trying to access a different domain. Do you want to allow this?"


Using jquery

<script src="" ></script>
$.get("", function(response) { alert(response) });
You can't request a page outside of your domain in this way, you have to do it via proxy, e.g. $.get('')
Javascript can be used, as long as you grab whatever page you're after via a proxy on your domain:

<script src="/js/jquery-1.3.2.js"></script>
$.get("", function(response) { 
Simple way to start, try jQuery

$("#links").load("/Main_Page #jq-p-Getting-Started li");

More at jQuery Docs

Another way to do screen scraping in a much more structured way is to use YQL or Yahoo Query Language. It will return the scraped data structured as JSON or xml.
Lets scrap

select * from html where url=

will give you a JSON array(I chose that) like this

The beauty of this is that you can do projections and where clauses which ultimately gets you the scraped data structured and only the data what you need (much less bandwidth over the wire ultimately)

select * from html where url="" and

will get you

Once you write your query it generates a url for you'2F%2Fdiv%2Fh3%2Fa'%0A%20%20%20%20&amp;format=json&amp;callback=cbfunc

in our case.

So ultimately you end up doing something like this

var titleList = $.getJSON(theAboveUrl);

and play with it.

Beautiful, isn’t it?


when using many of the methods listed above, you will find that sites like youtube, for example, will block your code, either by returning error statements instead of the page's html, or else preventing you from accessing the html entirely if you try to access a series of pages or the same page multiple times. Recently i tried to scrape the html from my "view friends" page, ";view=friends", and got, instead of the html containing my list of friends, error messages. yet the code worked perfectly for other web pages/sites.
