spider

How to check if a page is displaying a specific <img> tag.

What is the best way to determine if a page on a website is REALLY displaying a specific img tag like this <img src=http://domain.com/img.jpg&gt;? A simple string comparison is easy to fool using http comments <!-- -->. Even if the html tag exists it could be deleted with JavaScript. It could also be obscured by placing an image over...

Recommended library for scraping html data.

I need to process quite a bit of [fairly] arbitrary html data. The data thankfully can be broken into about twelve different templates. My current plan is to build a filter for each of the templates that allows me to extract the required data sans irrelevant content. Problem is I'm not sure what the ideal tool for the job is. I was h...

Concurrent downloads - Python

Hi folks, the plan is this: I download a webpage, collect a list of images parsed in the DOM and then download these. After this I would iterate through the images in order to evaluate which image is best suited to represent the webpage. Problem is that images are downloaded 1 by 1 and this can take quite some time. It would be gre...

Prepare your site images for google image search indexing

Hi, I'm trying to understand how can I do to let my site be reachable from google image search spiders. I like how last.fm solution, and I thought to use a technique like his staff do to let google find artists images on their pages. When I'm looking for an artist and I search it on google image search, as often as not I find an image f...

How to create a web crawler/spider/robot?

Is there a way to make a web robot like websiteoutlook.com does? I need something that searches the internet for URLs only...I don't need links, descriptions, etc. What is the best way to do this without getting too technical? I guess it could even be a cronjob that runs a PHP script grabbing URLs from Google, or is there a better way...

Is there anyway of making json data readable by a Google spider?

Is it possible to make JSON data readable by a Google spider? Say for instance that I have a JSON feed that contains the data for an e-commerce site. This JSON data is used to populate a human-readable page in the users browser. (I.E. The translation from JSON data to human displayed page is done inside the users browser; not my choic...

Storing URLs while Spidering

I created a little web spider in python which I'm using to collect URLs. I'm not interested in the content. Right now I'm keeping all the visited URLs in a set in memory, because I don't want my spider to visit URLs twice. Of course that's a very limited way of accomplishing this. So what's the best way to keep track of my visited URLs?...

JDOM 1.1: hyphen is not a valid comment character

I'm using tagsoup to clean some HTML I'm scraping from the internet, and I'm getting the following error when parsing through pages with comments: The data "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - " is not legal for a JDOM com...

Worried about spiders repeatedly hitting high-demand page

Due to some rather bizarre architectural considerations I've had to set up something that really ought to run as a console application as a web page. It does the job of writing a large variety of text files and xml feeds from our site data for various other services to pick up so obviously it takes a little while to run and is pretty pro...

Wikipedia text download

Hi, I am looking to download full Wikipedia text for my college project. Do I have to write my own spider to download this or is there a public dataset of Wikipedia available online? To just give you some overview of my project, I want to find out the interesting words of few articles I am interested in. But to find these interesting w...

Getting Started with Python: Attribute Error

I am new to python and just downloaded it today. I am using it to work on a web spider, so to test it out and make sure everything was working, I downloaded a sample code. Unfortunately, it does not work and gives me the error: "AttributeError: 'MyShell' object has no attribute 'loaded' " I am not sure if the code its self has an erro...

Spider/Crawler for testing an AJAX web app that requires a session cookie?

We have a web app that is heavy on AJAX and it is very customizable so we need something that will click on every link in it to make sure that none of the forms/pages break. I know that there are lots of spiders/crawlers out there but we haven't been able to find one thats easy to implement and works with AJAX and allows you to have a se...

How to extract the headline and content from a crawled web page / article?

I need some guidelines on how to detect the headline and content of crawled pages. I've been seeing some very weird front-end codework since i started working on this crawler. ...

Spider a Website and Return URLs Only

I'm not quite sure how best to define/articulate this, but I'm looking for a way to pseudo-spider a website. The key is that I don't actually want the content, but rather a simple list of URIs. I can get reasonably close to this idea with Wget using the --spider option, but when piping that output through a grep, I can't seem to find the...

Creating a spider using Scrapy, Spider generation error.

I just downloaded Scrapy (web crawler) on Windows 32 and have just created a new project folder using the "scrapy-ctl.py startproject dmoz" command in dos. I then proceeded to created the first spider using the command: scrapy-ctl.py genspider myspider myspdier-domain.com but it did not work and returns the error: Error running: scrap...

Maximum page fetch with maximum bandwith

Hi I want to create an application like a spider I've implement fetching page as the following code in multi-thread application but there is two problem 1) I want to use my maximum bandwidth to send/receive request, how should I config my request to do so (Like Download Accelerator application and the like) cause I heard the normal appl...

Is flash's geturl(...) spiderable by google?

If I made a homepage with an embedded .swf which had buttons that linked to other html pages on my website using the getUrl() function, would those links be spiderable by google? Or should I also put in text links outside of the .swf (which would ruin the design a bit)? I know a lot of people will argue I shouldn't have flash as the mai...

Is there a spider for Zend Lucene?

Is there a pre-written PHP spider/crawler that can be used to feed documents to the Zend_Search_Lucene indexer? I've found Sphider but it is very tightly coupled to MySQL, and not able to be integrated easily with Zend Lucene (as far as I can tell) I'd originally written the search index to work on CMS/Wordpress page-save, so no spideri...

More problems with my Perl Tutorial

Thanks to everyone who has helped me get this far. Now my new problem. I'm working with a book that was written in 2003 and the tutorial is trying to spider a page that has changed. The original address is: "http://www.oreilly.com/catalog/prdindex.html" this page no longer exists but it does redirect to the new page: "http://oreilly.co...

how to parse a string to spider from another script

Hi all!!! I am new to python and scrapy . I am running the scrapy-ctl.py from another python script using subprocess module.But I want to parse the 'start url' to the spider from this script itself.Is it possible to parse start_urls(which are determined in the script from which scrapy-ctl is run) to the spider? I will be greatful f...