What is the best way to determine if a page on a website is REALLY displaying a specific img tag like this <img src=http://domain.com/img.jpg>? A simple string comparison is easy to fool using http comments <!-- -->. Even if the html tag exists it could be deleted with JavaScript. It could also be obscured by placing an image over...
I need to process quite a bit of [fairly] arbitrary html data. The data thankfully can be broken into about twelve different templates. My current plan is to build a filter for each of the templates that allows me to extract the required data sans irrelevant content. Problem is I'm not sure what the ideal tool for the job is.
I was h...
Hi folks,
the plan is this:
I download a webpage, collect a list of images parsed in the DOM and then download these. After this I would iterate through the images in order to evaluate which image is best suited to represent the webpage.
Problem is that images are downloaded 1 by 1 and this can take quite some time.
It would be gre...
Hi, I'm trying to understand how can I do to let my site be reachable from google image search spiders.
I like how last.fm solution, and I thought to use a technique like his staff do to let google find artists images on their pages.
When I'm looking for an artist and I search it on google image search, as often as not I find an image f...
Is there a way to make a web robot like websiteoutlook.com does? I need something that searches the internet for URLs only...I don't need links, descriptions, etc.
What is the best way to do this without getting too technical? I guess it could even be a cronjob that runs a PHP script grabbing URLs from Google, or is there a better way...
Is it possible to make JSON data readable by a Google spider?
Say for instance that I have a JSON feed that contains the data for an e-commerce site. This JSON data is used to populate a human-readable page in the users browser. (I.E. The translation from JSON data to human displayed page is done inside the users browser; not my choic...
I created a little web spider in python which I'm using to collect URLs. I'm not interested in the content. Right now I'm keeping all the visited URLs in a set in memory, because I don't want my spider to visit URLs twice. Of course that's a very limited way of accomplishing this.
So what's the best way to keep track of my visited URLs?...
I'm using tagsoup to clean some HTML I'm scraping from the internet, and I'm getting the following error when parsing through pages with comments:
The data "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - " is not legal for a JDOM com...
Due to some rather bizarre architectural considerations I've had to set up something that really ought to run as a console application as a web page. It does the job of writing a large variety of text files and xml feeds from our site data for various other services to pick up so obviously it takes a little while to run and is pretty pro...
Hi,
I am looking to download full Wikipedia text for my college project. Do I have to write my own spider to download this or is there a public dataset of Wikipedia available online?
To just give you some overview of my project, I want to find out the interesting words of few articles I am interested in. But to find these interesting w...
I am new to python and just downloaded it today. I am using it to work on a web spider, so to test it out and make sure everything was working, I downloaded a sample code. Unfortunately, it does not work and gives me the error:
"AttributeError: 'MyShell' object has no attribute 'loaded' "
I am not sure if the code its self has an erro...
We have a web app that is heavy on AJAX and it is very customizable so we need something that will click on every link in it to make sure that none of the forms/pages break. I know that there are lots of spiders/crawlers out there but we haven't been able to find one thats easy to implement and works with AJAX and allows you to have a se...
I need some guidelines on how to detect the headline and content of crawled pages. I've been seeing some very weird front-end codework since i started working on this crawler.
...
I'm not quite sure how best to define/articulate this, but I'm looking for a way to pseudo-spider a website. The key is that I don't actually want the content, but rather a simple list of URIs. I can get reasonably close to this idea with Wget using the --spider option, but when piping that output through a grep, I can't seem to find the...
I just downloaded Scrapy (web crawler) on Windows 32 and have just created a new project folder using the "scrapy-ctl.py startproject dmoz" command in dos. I then proceeded to created the first spider using the command:
scrapy-ctl.py genspider myspider myspdier-domain.com
but it did not work and returns the error:
Error running: scrap...
Hi
I want to create an application like a spider I've implement fetching page as the following code in multi-thread application but there is two problem
1) I want to use my maximum bandwidth to send/receive request, how should I config my request to do so (Like Download Accelerator application and the like) cause I heard the normal appl...
If I made a homepage with an embedded .swf which had buttons that linked to other html pages on my website using the getUrl() function, would those links be spiderable by google? Or should I also put in text links outside of the .swf (which would ruin the design a bit)?
I know a lot of people will argue I shouldn't have flash as the mai...
Is there a pre-written PHP spider/crawler that can be used to feed documents to the Zend_Search_Lucene indexer? I've found Sphider but it is very tightly coupled to MySQL, and not able to be integrated easily with Zend Lucene (as far as I can tell)
I'd originally written the search index to work on CMS/Wordpress page-save, so no spideri...
Thanks to everyone who has helped me get this far.
Now my new problem. I'm working with a book that was written in 2003 and the tutorial is trying to spider a page that has changed.
The original address is: "http://www.oreilly.com/catalog/prdindex.html" this page no longer exists but it does redirect to the new page: "http://oreilly.co...
Hi all!!!
I am new to python and scrapy .
I am running the scrapy-ctl.py from another python script using
subprocess module.But I want to parse the 'start url' to the spider from
this script itself.Is it possible to parse start_urls(which are
determined in the script from which scrapy-ctl is run) to the spider?
I will be greatful f...