spider

Looking for a Spider ActiveX Control (for vb6)

The control needs to behave like a Spider/Gantt control, but with a small difference, i need vertical lines on the x axis. ...

Looping through DirectoryEntry or any object hierarchy - C#

I am currently developing an application that use the System.DirectoryServices namespace to create a DirectoryEntry object and loop through the entire hierarchy to collect information. I do not know number of child entries for each DirectoryEntry object in the hierarchy, so I can not create a N number of nested loops to spiders throug...

Website Spidering Auto Detection

is it possible to write code to detect if a website are spidering the content? ...

is there a good web crawler library available for PHP or Ruby?

is there a good web crawler library available for PHP or Ruby? a library that can do it depth first or breadth first... and handle the links even when href="../relative_path.html" and base url is used. ...

Get a list of URLs from a site

I'm deploying a replacement site for a client but they don't want all their old pages to end in 404s. Keeping the old URL structure wasn't possible because it was hideous. So I'm writing a 404 handler that should look for an old page being requested and do a permanent redirect to the new page. Problem is, I need a list of all the old pa...

How to migrate resources from proprietary CMS?

I need to migrate our website from a proprietary CMS that uses active server pages. Is there a tool or technique that will help download the resources from the existing site? I guess I'm looking for a tool that will crawl and scrape the entire site. An additional challenge is that the site uses SSL and is protected with forms-based au...

What's the best way of extracting javascript links in an HTML document?

Hi All, I am writing a small webspider for a website which uses a lot of javascript for links: <htmlTag onclick="someFunction();">Click here</htmlTag> where the function looks like: function someFunction() { var _url; ... // _url constructed, maybe with reference to a value in the HTML doc // and/or a value passed as argume...

Backlink-reporting website crawler?

What tools are there out there to crawl a website and report, for each page, a list of pages within the website that link to it? ...

Google indexed my test folders on my website :( How do I restrict the web crawlers!

Help Help! Google indexed a test folder on my website which no one save I was supposed to know about :(! How do I restrict google from indexing links and certain folders. ...

Developing a crawler and scraper for a vertical search engine

I need to develop a vertical search engine as part of website. The data for the search engine comes from websites of specific category. I guess for this I need to have a crawler that crawls several (a few hundred) sites (in a specific business category) and extract content and urls of products and services. Other types of pages may be i...

How to download a webpage in every five minutes ?

I want to download a list of web pages. I know wget can do this. However downloading every URL in every five minutes and save them to a folder seems beyond the capability of wget. Does anyone knows some tools either in java or python or Perl which accomplishes the task? Thanks in advance. ...

Getting the refering page from wget when recursively searching.

I'm trying to find any dead links on a website using wget. I'm running: wget -r -l20 -erobots=off --spider -S http://www.example.com which recursively checks to make sure each link on the page exists and retrieves the headers. I am then parsing the output with a simple script. I would like to know which page wget retrieved a given ...

How can I prevent the googlebot from crawling Ajaxified Links?

I've got a bunch of ajaxified links that do things like vote up, vote down, flag a post - standard community moderation stuff. Problem is that the googlebot crawls those links, and votes up, votes down, and flags items. Will adding this to robots.txt prevent the googlebot from crawling those links? Or is there something else I need to...

HTML parser...My recent project needs a web spider..

HTML parser...My recent project needs a web spider..it automatically get web content which it gets the links recursively.... But, it needs to know its content exactly. like tag. it runs in linux and windows..do you know some opensource about this needs.. thanx or about some suggestion. ...

How to detect if a visitor is human and not a spider

I am logging every visit to my website and determining if the visitor is human is important. I have searched the web and found many interesting ideas on how to detect if the visitor is human. if the visitor is logged in and passed captcha detecting mouse events Detecting if the user has a browser [user agent] detecting mouse clicks [...

C# library similar to HtmlUnit

Hello. I need to write standalone application which will "browse" external resource. Is there lib in C# which automatically handles cookies and supports JavaScript (through JS is not required I believe)? The main goal is to keep session alive and submitting forms so I could pass multistep registration process or "browse" web site after ...

Best Site Spider ?

Hi, I am moving a bunch of sites to a new server, and to ensure i don't miss anything, want to be able to give a program a list of sites and for it to download every page/image on there. Is there any software that can do this? I may also use it to download a copy of some wordpress sites, so i can just upload static files (some of my WP s...

Python Package For Multi-Threaded Spider w/ Proxy Support?

Instead of just using urllib does anyone know of the most efficient package for fast, multithreaded downloading of URLs that can operate through http proxies? I know of a few such as Twisted, Scrapy, libcurl etc. but I don't know enough about them to make a decision or even if they can use proxies.. Anyone know of the best one for my pur...

what do you call a spidering technique where spider visits all links in the first level, and all links in second level.

i forgot the name to a case where a web spider will first visits all links it sees on the first level. then visits all links it sees on the second level. and so on... there is a name for this technique....i forgot... anyways, this is very exhaustive and obviously inefficient. Is there a better way ? I remember reading a paper in su...

Spider that tosses results into mysql

Looking to use Sphinx for site search, but not all of my site is in mysql. Rather than reinvent the wheel, just wondering if there's an open source spider that easily tosses its findings into a mysql database so that Sphinx can then index it. Thanks for any advice. ...