How to Scrape websites, client side or server side?

views:

403

answers:

+1 Q:

How to Scrape websites, client side or server side?

I am creating a bookmarklet button that, when the user clicks on this button in his browser, will scrape the current page and get some values from this page, such as price, item name and item image.

These fields will be variable, means that the logic of getting these values will be different for each domain "amazon, ebay" for example.

My questions are:

Should i use javascript to scrape these data then send to the server?
Or just send to my server side the URL then use .net code to scrape values?
What is the best way? and why its better? advantages, disadvantages?

Look at this video and you will understand what i want to do exactly http://www.vimeo.com/1626505

+2 A:

If you want to pull information from another site for use in your site (written in ASP.NET, for example) then you'll typically do this on the server side so that you have rich language for processing the results (e.g. C#). You'll do this via a WebRequest object in .NET.

The primary use of client side processing is to use Javascript to pull information to display on your site. An example would be the scripts provided by the Weather Channel to show a little weather box on your site or for very simple actions such as adding a page to favorites.

UPDATE: Amr writes that he is attempting to recreate the functionality of some popular screen scraping software which would require some quite sophisticated processing. Amr, I'd consider creating an application that uses the IE browser object to display web pages - it is quite simple. You could then just pull the InnerHTML (I think, it has been a few years since I implemented an IE-object-based program) to retrieve the contents of the page and do your magic. You could, of course, use a WebRequest object (just handing it the URL used in the browser object) but that wouldn't be very efficient as it would download the page a second time.

Is this what you are after?

Mark Brittingham 2009-04-05 14:41:09

i think its not phishing, http://en.wikipedia.org/wiki/Web-scraping_software_comparison

Amr ElGarhy 2009-04-05 14:44:47

It's a bookmarklet, thiscan be easily done, it can be dangerous in the wrong hands. But check out Magnolia for a great bookmarklet app.

Robert Gould 2009-04-05 14:46:45

Thanks Robert. I'm not familiar with bookmarklets or Magnolia. I'll check it out.

Mark Brittingham 2009-04-05 14:55:46

whats Magnolia, whats its URL?

Amr ElGarhy 2009-04-05 15:02:42

I exact spelling is Ma.gnolia anyways google it and it's right there

Robert Gould 2009-04-05 15:10:15

I would scrape it on the server side, because (i'am Java guy) i like static languages more then dynamic script languages, so maintaining the logic at the backend would be more comfortable to me. On the other side depends on how many items you want to scrape and how complex the logic for this would be. Perhaps the values are parseable with a single id selector in JavaScript, then server side processing could be overkill.

Mork0075 2009-04-05 14:44:55

Bookmarklets are client-side per definition, but you could have the client depend on a server, but your example doesn't provide enough information. What do you want to do with the scraped info?

Robert Gould 2009-04-05 14:48:48

+1 A:

If you want to use only JavaScript to do this, you are liable to have a fairly large bookmarklet unless you know the exact layout of every site it will be used on (and even then it will be big).

A common way I have seen this done is to use a web service on your own server that your bookmarklet (which uses JavaScript) redirects to along with some parameters, like the URL of the page you are viewing. Your server would then scrape the page and do the work of parsing the HTML for the things you are interested in.

A good example is the "Import to Mendeley" bookmarklet, which passes the URL of the page you are visiting to its server where it then extracts information about scientific papers listed on the page and imports them into your collection.

ealdent 2009-04-05 14:59:07

If you include the scraping code in the bookmarlet your users will have to update their bookmark if you include new functionality or bug-fixes. Do it server-side and all your users get the new stuff instantly :)

Adam Pope 2009-04-05 15:09:54

ansaurus

tags:

views:

answers:

How to Scrape websites, client side or server side?

related questions