What's a good book about web crawling?
I'm searching for a good book which discusses the problems of developing a web-crawler in java. Not this typical "search engine optimization" books. What books can you recommend? ...
I'm searching for a good book which discusses the problems of developing a web-crawler in java. Not this typical "search engine optimization" books. What books can you recommend? ...
I'm writing a multi-threaded Java web crawler. From what I understand of the web, when a user loads a web page the browser requests the first document (eg, index.html) and as it receives the HTML it will find other resources that need to be included (images, CSS, JS) and ask for those resources concurrently. My crawler is only requestin...
It seems like Google can index certain sites or forums (I can't name any offhand as its been months since I last saw one) and when accessing you are prompted with a request to register or login. How would I make my site open for Google to index and have a regular login for others? ...
I'm trying to find any dead links on a website using wget. I'm running: wget -r -l20 -erobots=off --spider -S http://www.example.com which recursively checks to make sure each link on the page exists and retrieves the headers. I am then parsing the output with a simple script. I would like to know which page wget retrieved a given ...
I would like to automatically detect Google and other Crawlers and log them into my ASP.NET website. Has anyone found a reliable way to do this? The Login part is easy, however to reliably detect them is the real issue. Regards. ...
Is it possible to find all the pages and links on ANY given website? I'd like to enter a URL and produce a directory tree of all links from that site? I've looked at HTTrack but that downloads the whole site and I simply need the directory tree. Thanks Jonathan ...
Hello everyone, I am using VSTS 2008 + C# + .Net 3.5. I want to find a tool (open source) which crawls all web pages for a web site, and for any other domain pages which is linked by this web site, I want to skip to crawl them (I only need page for this specific domain only). For crawled web page, I want to store them into local file di...
Possible Duplicate: How to write a crawler? I am interested in learning about how web crawlers and other similar "robots" work, does anyone have any recommendations on reading material (preferably free and online)? Note: i am not interested in building a web crawler, I'm just interested in how they work. ...
I would like to create a crawler using php that would give me a list of all the pages on a specific domain (starting from the homepage: www.example.com). How can I do this in php? I don't know how to recursively find all the pages on a website starting from a specific page and excluding external links. ...
Hello. I need to write standalone application which will "browse" external resource. Is there lib in C# which automatically handles cookies and supports JavaScript (through JS is not required I believe)? The main goal is to keep session alive and submitting forms so I could pass multistep registration process or "browse" web site after ...
I'm tossing around a few ideas for travel search engines and I'm wondering how these sites get their source data. Do they scrape all the content from airline homepages? This seems like an enormous job given the number of airlines etc out there. Is there some API or web service standard that every airline conforms too? Am I going to h...
Bonjour, does anyone know of a way of creating a spider that acts as if it has javascript enabled? PHP Code: file_get_contents("http://www.google.co.uk/search?hl=en&q=".$keyword."&start=".($x*10)."&sa=N") it would retrieve the output of that page. If you used, PHP Code: file_get_contents("http://www.facebook.com/somethi...
Hi Everyone, I want to know How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol I am able to do it on local file systems using file:// protocol but not http protocol ...
Hi everyone all over the world, Background I am a final year student of Computer Science. I've proposed my Final Double Module Project which is a Plagiarism Analyzer, using Java and MySQL. The Plagiarism Analyzer will: Scan all the paragraphs of uploaded document. Analyze percentage of each paragraph copied from which website. High...
Right now, I can crawl regular pages using urllib2. request = urllib2.Request('http://stackoverflow.com') request.add_header('User-Agent',random.choice(agents)) response = urllib2.urlopen(request) htmlSource = response.read() print htmlSource However...I would like to simulate a POST (or fake sessions)? so that I can go into Facebook ...
I looked at the Heritrix documentation website, and they listed a Python .ARC file reader. However, it is 404 not found when I clicked on it. http://crawler.archive.org/articles/developer%5Fmanual/arcs.html Does anyone else know any Heritrix ARC reader that uses Python? (I asked this question before, but closed it due to inaccuracy) ...
I often have to work with fragile legacy websites that break in unexpected ways when logic or configuration are updated. I don't have the time or knowledge of the system needed to create a Selenium script. Besides, I don't want to check a specific use case - I want to verify every link and page on the site. I would like to create an a...
Hello guys, I have a contact form where the email is actually accessible in the source, because I'm using a cgi file to process it. My concern are the mail crawlers, and I was wondering if this is a no-go and I should switch to another more secure form. Or, if there was some tricks to 'confuse' the crawlers ? Thanks for your ideas. ...
Hi, I am moving a bunch of sites to a new server, and to ensure i don't miss anything, want to be able to give a program a list of sites and for it to download every page/image on there. Is there any software that can do this? I may also use it to download a copy of some wordpress sites, so i can just upload static files (some of my WP s...
Is there a simple .Net wrapper for Firefox or Chrome so that I could implement a web crawler and other web stuff? I might need post-form functionality also. ...