views:

98

answers:

2
+1  Q: 

crawler vs scraper

Can somebody distinguish between a crawler and scraper in terms of scope and functioanlity Thanks Nayn

+1  A: 

Crawlers surf the web, following links. An example would be the Google robot that gets pages to index. Scrapers extract values from forms, but don't necessarily have anything to do with the web.

Steven Sudit
Scrapers extract values from HTML, not necessarily forms.
BC
Scrapers extract value from screens, not necessarily HTML. For example, I once used a scraper to extract values from old mainframe forms.
Steven Sudit
+4  A: 

A crawler gets web pages -- i.e., given a starting address (or set of starting addresses) and some conditions (e.g., how many links deep to go, types of files to ignore) it downloads whatever is linked to from the starting point(s).

A scraper takes pages that have been downloaded [Edit: or, in a more general sense, data that's formatted for display], and (attempts to) extract data from those pages, so that it can (for example) be stored in a database and manipulated as desired.

Depending on how you use the result, scraping may well violate the rights of the owner of the information and/or user agreements about use of web sites (crawling violates the latter in some cases as well). Edit: as mentioned by Steven Sudit, many sites include a file named robots.txt in their root (i.e. having the URL http://server/robots.txt) to specify how (and if) crawlers should treat that site -- in particular, it can list (partial) URLs that a crawler should not attempt to visit. These can be specified separately per crawler (user-agent) if desired.

Jerry Coffin
We should probably mention the robots.txt file that tells crawlers where not to crawl.
Steven Sudit
+1 for adding the robots.txt information.
Steven Sudit
@Steven: Oops -- my apologies for misspelling your name.
Jerry Coffin
@Jerry: Don't worry about it.
Steven Sudit