tags:

views:

105

answers:

5

any good books or academic papers on web scraping or web spiders?

+1  A: 

If you're looking for a general survey, I would recommend this article.

For something spicier, although still at survey levels, a nice recent article from "Data & Knowledge Engineering" is here and it specifically addresses the issues of "focused web crawlers" as opposed to classic/generic ones.

Is this the kind of reference you're looking for, or are you looking for foundational papers (i.e. ones that are probably a bit out of date by now but are widely quoted because, in their time, they provided significant breakthroughs or innovation)?

Alex Martelli
yes, i am looking more for research papers, or other sources that talks about different techniques. i already built spiders before but I just want to expand my knowledge in this area.
gpow
it looks the two url is the same article?
sunqiang
@sunqiang, copy and paste accident, tx, fixed!
Alex Martelli
A: 

As to extracting web page contents, you may find http://portal.acm.org/citation.cfm?id=775152.775182 and http://portal.acm.org/citation.cfm?id=775047.775134 helpful.

DOM-based content extraction of HTML documents: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.57.9196&rep=rep1&type=pdf

Discovering informative content blocks from Web documents: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.103.7769&rep=rep1&type=pdf

ZelluX
great but you ned to pay for those.
gpow
@gpow You can search for the paper title on scholar.google.com, and you will find free download links
ZelluX
A: 

http://wwwconference.org/www2008/papers/fp865.html - the best paper on www 2008.

ton4eg
this is by far the most interesting paper i have come across.
gpow
A: 

A couple of books to on spidering:

Bill the Lizard