web-crawler

What's a good book about web crawling?

I'm searching for a good book which discusses the problems of developing a web-crawler in java. Not this typical "search engine optimization" books. What books can you recommend? ...

How Many Java HttpURLConnections Should I Be Able to Open Concurrently?

I'm writing a multi-threaded Java web crawler. From what I understand of the web, when a user loads a web page the browser requests the first document (eg, index.html) and as it receives the HTML it will find other resources that need to be included (images, CSS, JS) and ask for those resources concurrently. My crawler is only requestin...

How do I allow Google to index login-required parts of my site?

It seems like Google can index certain sites or forums (I can't name any offhand as its been months since I last saw one) and when accessing you are prompted with a request to register or login. How would I make my site open for Google to index and have a regular login for others? ...

Getting the refering page from wget when recursively searching.

I'm trying to find any dead links on a website using wget. I'm running: wget -r -l20 -erobots=off --spider -S http://www.example.com which recursively checks to make sure each link on the page exists and retrieves the headers. I am then parsing the output with a simple script. I would like to know which page wget retrieved a given ...

Automatically Login Google Web Crawler

I would like to automatically detect Google and other Crawlers and log them into my ASP.NET website. Has anyone found a reliable way to do this? The Login part is easy, however to reliably detect them is the real issue. Regards. ...

How to find all links / pages on a website

Is it possible to find all the pages and links on ANY given website? I'd like to enter a URL and produce a directory tree of all links from that site? I've looked at HTTrack but that downloads the whole site and I simply need the directory tree. Thanks Jonathan ...

.Net based web crawler sample

Hello everyone, I am using VSTS 2008 + C# + .Net 3.5. I want to find a tool (open source) which crawls all web pages for a web site, and for any other domain pages which is linked by this web site, I want to skip to crawl them (I only need page for this specific domain only). For crawled web page, I want to store them into local file di...

Resources on how web crawlers work?

Possible Duplicate: How to write a crawler? I am interested in learning about how web crawlers and other similar "robots" work, does anyone have any recommendations on reading material (preferably free and online)? Note: i am not interested in building a web crawler, I'm just interested in how they work. ...

Basic web-crawling question: How to create a list of all pages on a website using php?

I would like to create a crawler using php that would give me a list of all the pages on a specific domain (starting from the homepage: www.example.com). How can I do this in php? I don't know how to recursively find all the pages on a website starting from a specific page and excluding external links. ...

C# library similar to HtmlUnit

Hello. I need to write standalone application which will "browse" external resource. Is there lib in C# which automatically handles cookies and supports JavaScript (through JS is not required I believe)? The main goal is to keep session alive and submitting forms so I could pass multistep registration process or "browse" web site after ...

How do travel search engines & aggregators get their source data?

I'm tossing around a few ideas for travel search engines and I'm wondering how these sites get their source data. Do they scrape all the content from airline homepages? This seems like an enormous job given the number of airlines etc out there. Is there some API or web service standard that every airline conforms too? Am I going to h...

php crawl - javascript enabled

Bonjour, does anyone know of a way of creating a spider that acts as if it has javascript enabled? PHP Code: file_get_contents("http://www.google.co.uk/search?hl=en&q=".$keyword."&start=".($x*10)."&sa=N") it would retrieve the output of that page. If you used, PHP Code: file_get_contents("http://www.facebook.com/somethi...

How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol

Hi Everyone, I want to know How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol I am able to do it on local file systems using file:// protocol but not http protocol ...

Plagiarism Analyzer (compared against Web Content)

Hi everyone all over the world, Background I am a final year student of Computer Science. I've proposed my Final Double Module Project which is a Plagiarism Analyzer, using Java and MySQL. The Plagiarism Analyzer will: Scan all the paragraphs of uploaded document. Analyze percentage of each paragraph copied from which website. High...

How to use Python to log into Facebook/Myspace and crawl the content?

Right now, I can crawl regular pages using urllib2. request = urllib2.Request('http://stackoverflow.com') request.add_header('User-Agent',random.choice(agents)) response = urllib2.urlopen(request) htmlSource = response.read() print htmlSource However...I would like to simulate a POST (or fake sessions)? so that I can go into Facebook ...

How to read .ARC files from the Heritrix crawler using Python?

I looked at the Heritrix documentation website, and they listed a Python .ARC file reader. However, it is 404 not found when I clicked on it. http://crawler.archive.org/articles/developer%5Fmanual/arcs.html Does anyone else know any Heritrix ARC reader that uses Python? (I asked this question before, but closed it due to inaccuracy) ...

Automated link-checker for system testing

I often have to work with fragile legacy websites that break in unexpected ways when logic or configuration are updated. I don't have the time or knowledge of the system needed to create a Selenium script. Besides, I don't want to check a specific use case - I want to verify every link and page on the site. I would like to create an a...

E-mail in the source : a no-go ?

Hello guys, I have a contact form where the email is actually accessible in the source, because I'm using a cgi file to process it. My concern are the mail crawlers, and I was wondering if this is a no-go and I should switch to another more secure form. Or, if there was some tricks to 'confuse' the crawlers ? Thanks for your ideas. ...

Best Site Spider ?

Hi, I am moving a bunch of sites to a new server, and to ensure i don't miss anything, want to be able to give a program a list of sites and for it to download every page/image on there. Is there any software that can do this? I may also use it to download a copy of some wordpress sites, so i can just upload static files (some of my WP s...

Is there a .Net wrapper for Firefox or Chrome to crawl webpages?

Is there a simple .Net wrapper for Firefox or Chrome so that I could implement a web crawler and other web stuff? I might need post-form functionality also. ...