spider

What is the best way to control were Scrapy Crawls when collecting large amounts of specific data from many different sites?

I have been working on a spider that gathers data for research using Scrapy. It crawls around 100 sites that each have a large amount of links within them. I need to specifly were the spider crawls so that I can tell the spider to collect data from certain parts of the site, while not crawling others to save time. I have been having muc...

Make a Web Crawler/ Spider

Hi, I'm looking into making a web crawler/ spider but I need someone to point me in the right direction to get started. Basically, my spider is going to search for audio files and index them. I'm just wondering if anyone has any ideas for how I should do it. I've heard having it done in php would be extremely slow. I know vb.net so c...

Can google see the contents of an iframe when spidering?

I've looked this up and have not found consistent answers. I want to embed a google doc in my page (when you publish your google doc it gives you an iframe). Will search engines like google be able to read the contents of the document (just text, but may have important keywords)? Or will it act as if the page was empty? If it cannot ind...

Code style preferences using Sicstus and Eclipse (Spider)

Hi all, i I am currently using Sicstus Prolog VC9 4.1.1 within Eclipse Galileo (Spider). I have got a very newbie question: how can I automatically control indentation and in general code style preferences? Thanks, I. ...

How to remove u'' from python script result?

Hello. I'm trying to wrote parsing script using python/scrapy. How can I remove [] and u' from strings in result file? Now I have text like this: from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.utils.markup import remove_tags from googleparser.items import GoogleparserItem import sys clas...

Does the spiders indexing your website (google bot...) have a "culture" ?

Hello, This is a SEO question : i've the choice to display a page's title according to the culture of the visitor. If it's an english : <title> <?php if ($sf_user->getCulture() == 'en') : ?> Hello, this is an english website <?php else ?> Bonjour, ceci est un site français <?php endif ?> </title> Does the bots/spid...

is it posible to create Web Bot in delphi?

i was just curious about webbot project and i wish that i could create something similar to that project. ...

How can I gather all links on a site without content?

I would like to get all URLs a site links to (on the same domain) without downloading all of the content with something like wget. Is there a way to tell wget to just list the links it WOULD download? For a little background of what I'm using this for if someone can come up with a better solution: I'm trying to build a robots.txt file t...

Suggestions on Spidering/Cloning an annoying website, logins/ javascript to transverse

A friend of mine wants to collect data from an website. I recommended that spidering would be a fast way of automating the process. But when I saw the website, I found that it wasn't so simple at all. First a login with a captcha thwarts most spidering software, there is no way that I can manually log in and use the cookie to get all ot...

Make a link completely invisible?

I'm pretty sure that many people have thought of this, but for some reason I can't find it using Google and StackOverflow search. I would like to make an invisible link (blacklisted by robots.txt) to a CGI or PHP page that will "trap" malicious bots and spiders. So far, I've tried: Empty links in the body: <a href='/trap'><!-- nothin...

using form buttons for spamproof email-addresses

I have been looking at some methods for spamproof email methods here. I'd like to propose a more simple approach: Since I need a couple of different email addresses I considered just using a selectbox with JS or serverside redirect, as per examples on here. Because google doesn't spider forms (dixit Matt Cutts), and spam-harvester scrip...

Extracting data from an ASPX page

I've been entrusted with an idiotic and retarded task by my boss. The task is: given a web application that returns a table with pagination, do a software that "reads and parses it" since there is nothing like a webservice that provides the raw data. It's like a "spider" or a "crawler" application to steal data that is not meant to be a...

can someone suggest a web spider?

Is there a web spider which can grap the contents of the forums? My company does not provide the internet connection, so I want to grap the threads of a forum, then I can have a look at the contents in company. I have tried the WebLech, it can just grap the static pages. ...

anemone scrubbing a certain page depth

I am not understanding how to use the tentacle part of the anemone. If I am interpreting it right I feel i could use it to only scrub a certain page depth away from the root. Anemone.crawl(start_url) do |anemone| tentacle.new(i think but not working) anemone.on_every_page do |page| puts page.depth puts page.url e...

Can I use WGET to generate a sitemap of a website given its URL?

I need a script that can spider a website and return the list of all crawled pages in plain-text or similar format; which I will submit to search engines as sitemap. Can I use WGET to generate a sitemap of a website? Or is there a PHP script that can do the same? ...

How to disallow access to an url called without parameters with robots.txt

I would like to deny web robots to access a url like this: http://www.example.com/export allowing this kind of url instead: http://www.example.com/export?foo=value1 A spider bot is calling /export without query string causing a lot of errors on my log. Is there a way to manage this filter on robots.txt? ...

PHPCrawl sometimes returns empty handed

I'm using the PHPCrawl class to spider websites and build a list of links. It all works well, if slowly, and I then use the links to perform other tasks. I'm encountering a problem where the first time I run the script it completes with no result, then the next time I run it it works as expected. It's failing about 30% of the time. I t...

Python web crawling and storing to mysql

Hi, Looking for few days for some simple solution for this, but I think that in this moment I am still on the beginning :) I need good web crawler written in Python to store complete page into mysql database. Small system that I am experimenting uses now PHP Sphider to crawl and store into database. I need something that works almost ex...

What's up with Facebook policies vs. graph.facebook.com/robots.txt ?

Facebook's developer principles and policies and the general terms of use seem to forbid automated data collection, but graph.facebook.com/robots.txt seems to allow it: User-agent: * Disallow: Does anybody know how to make sense of this? ...