questions about crawler

How would I make a simple URL extracter in Python?

How would I start on a single web page, let's say at the root of DMOZ.org and index every single url attached to it. Then store those links inside a text file. I don't want the content, just the links themselves. An example would be awesome. ...

python

hyperlink

crawler

Algorithm: Determining type of homepage?

I've been thinking about this for a while now, so I thought I would ask for suggestions: I have some crawler which enters the root of some site (could be anything from www.StackOverFlow.com, www.SomeDudesPersonalSite.se or even www.Facebook.com). Then I need to determin what "kind of homepage" I'm visiting.. Different types could for in...

algorithm

crawler

heuristics

C# web and ftp crawler library

Hi! I need a library (hopefully in C#!) which works as a web crawler to access HTTP files and FTP files. In principle, Im happy with reading HTML, I want to extend it to PDF, WORD, etc.. Im happy with a starter's open source software or at least any directions for documentation. Best regards, David ...

c#

web-crawler

crawler

What is a Java web crawler library that allow xpath access and the equivalent of save as webpage complete?

Hi. I don't need to crawl the whole internet, I just need to open a few URL, extract other URL, and then save some page in a way that they can be browsed on the disk later. I would like to have some control on which like are downloaded and which are not with xpath. What library would be appropriate to program that? ...

java

crawler

What is a Ruby web crawler library that allow xpath access and the equivalent of save as webpage complete?

Hi. I don't need to crawl the whole internet, I just need to open a few URL, extract other URL, and then save some page in a way that they can be browsed on the disk later. What library would be appropriate to program that? ...

ruby

crawler

PHPCrawl sometimes returns empty handed

I'm using the PHPCrawl class to spider websites and build a list of links. It all works well, if slowly, and I then use the links to perform other tasks. I'm encountering a problem where the first time I run the script it completes with no result, then the next time I run it it works as expected. It's failing about 30% of the time. I t...

Python web crawling and storing to mysql

Hi, Looking for few days for some simple solution for this, but I think that in this moment I am still on the beginning :) I need good web crawler written in Python to store complete page into mysql database. Small system that I am experimenting uses now PHP Sphider to crawl and store into database. I need something that works almost ex...

PHP Crawler for a asp.net site

I wanna write a crawler to fetch data. from an asp.net site which uses javascript to do the pagination ...