In this webpage:
http://www.alvolante.it/news/pompe_benzina_%E2%80%9Ctruccate%E2%80%9D_autostrada-308391044
there is this image:
http://immagini.alvolante.it/sites/default/files/imagecache/anteprima_100/images/rifornimento_benzina.jpg
Why this image is indexed if in the robots.txt there is "Disallow: /sites/" ??
You can see that is ...
I'm currently working on a Perl script to gather data from the QuakeLive website.
Everything was going fine until I couldn't get a set of data.
I was using regexes for that and they work for everything apart from the favourite arena, weapon and game type. I just need to get the names of those three elements in a $1 for further processi...
Hi folks,
My goal is to allow less experienced people to setup the required parameters needed to scrape some information from a website.
The idea is that a user enters an URL, after which this URL is loaded in a frame. The user should then be able to select text within this frame, which should give me enough information to scrape this ...
Possible Duplicate:
Why google index this ?
In this webpage:
http://www.alvolante.it/news/pompe_benzina_%E2%80%9Ctruccate%E2%80%9D_autostrada-308391044
there is this image:
http://immagini.alvolante.it/sites/default/files/imagecache/anteprima_100/images/rifornimento_benzina.jpg
Why this image is indexed if in the robots.tx...
Hi,
I know scrapy.org that is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. I used it in some projects and it is very simple to use. But it is written in python.
My question is, are there simlar frameworks for php?
...
I found a project, jaxer which embeds Firefox's JavaScript engine on the server side, so it can parse HTML server-side very well. But, this project seems dead. It is really helpful for crawling web pages to parse HTML & extract data.
Is there some new technology useful for extracting information?
...
Hey Guys,
i have a Problem building a Web Spider in PHP, which is able to crawl hundreds of websites.
I tried several approaches...one with the snoopy browser class, one with Simple HTML DOM Parser and one with the sfWebBrowserPlugin for Symfony. I run into the same problem with all aproaches. My crawler crawls a site in 3 stages...cate...
I have an application that spiders websites for information. It seems like after 20-45 minutes of creating HttpWebRequests a bunch of them return timeouts. One thing we do is attach a BindIPDelegate anonymous function to give the request a specific IP since we round-robin through about 150 IPs.
I'm setting up the HttpWebRequest object...
I'm using WWW::Mechanize::Firefox to crawl pages that load some JavaScript after they have been loaded.
My code regarding this problem:
my ($firemech) = WWW::Mechanize::Firefox->new(tab => 'current', );
$firemech->get($url);
die "Cannot connect to $url\n" if !$firemech->success();
print "I'm connected!\n";
my ($retries) = 10;
while ($...
Hi,
I have a site with some restricted content. I want my site to appear in search results, but I do not want it to get public.
Is there a way by which I can allow crawlers to crawl through my site but prevent them from making it public?
The closest solution I have found is Google First Click Free but even it requires me to show the c...
Hi!
I have a small crawler/screen-scraping script that used to work half a year ago, but now, it doesnt work anymore. I checked the html and css values for the reg expression in the page source, but they are still the same, so from this point of view, it should work. Any guesses?
require "open-uri"
# output file
f = open 'results.csv'...
Hi!
I need a library (hopefully in C#!) which works as a web crawler to access HTTP files and FTP files. In principle, Im happy with reading HTML, I want to extend it to PDF, WORD, etc..
Im happy with a starter's open source software or at least any directions for documentation.
Best regards,
David
...
I have been looking around for a website that can automatically provide me we content relative to either a city/state combo, or a zip code. Essentially I want to have a bit of content pertinent to where my user actually is. Does anybody know of any online services that provide something like this? I also wouldn't be opposed to spidering ...
I wrote a web crawler in Python 2.6 using the Bing API that searches for certain documents and then downloads them for classification later. I've been using string methods and urllib.urlretrieve() to download results whose URL ends in .pdf, .ps etc., but I run into trouble when the document is 'hidden' behind a URL like:
http://www.oecd...
I have been Googling for sometime but I guess I am using the wrong set of keywords. Does anyone know this URI that lets me request permission from Facebook to let me crawl their network? Last time I was using Python to do this, someone suggested that I look at it but I couldn't find that post either.
...
Hi,
I have a page that has a form using this ajaxForm jQuery plugin. The form submits, and when it's complete, there is a call using $.get() to load some new content to the page.
My problem is, the Googlebot "appears" to be indexing the url in the $.get() method.
My first question is, is that even possible? I was under the impression...
I'm using the PHPCrawl class and added some DOMDocument and DOMXpath code to take specific data off web pages however the script stalls out before it gets even close to crawling the whole website.
I have set_time_limit set to 100000000 so that shouldn't be an issue.
Any ideas?
Thank you,
Nick
<?php
// It may take a while to crawl a...
I wanna write a crawler to fetch data. from an asp.net site which uses javascript to do the pagination
...
I want to create a simple search engine for learning purpose.
I want to know how to index a simple blog site.
A blog site has many pages and in every page there is a blogpost.
But in every page there are other stuff in common as well ( header, footer, category block and other stuff ).
In your opinion, How can I index this blog ?
The ...