web-crawler

Why google index this ?

In this webpage: http://www.alvolante.it/news/pompe_benzina_%E2%80%9Ctruccate%E2%80%9D_autostrada-308391044 there is this image: http://immagini.alvolante.it/sites/default/files/imagecache/anteprima_100/images/rifornimento_benzina.jpg Why this image is indexed if in the robots.txt there is "Disallow: /sites/" ?? You can see that is ...

How can I reliably parse a QuakeLive player profile using Perl?

I'm currently working on a Perl script to gather data from the QuakeLive website. Everything was going fine until I couldn't get a set of data. I was using regexes for that and they work for everything apart from the favourite arena, weapon and game type. I just need to get the names of those three elements in a $1 for further processi...

WYSIWYG web scraping/crawling setup using Javascript/html5?

Hi folks, My goal is to allow less experienced people to setup the required parameters needed to scrape some information from a website. The idea is that a user enters an URL, after which this URL is loaded in a frame. The user should then be able to select text within this frame, which should give me enough information to scrape this ...

Why google index this ?

Possible Duplicate: Why google index this ? In this webpage: http://www.alvolante.it/news/pompe_benzina_%E2%80%9Ctruccate%E2%80%9D_autostrada-308391044 there is this image: http://immagini.alvolante.it/sites/default/files/imagecache/anteprima_100/images/rifornimento_benzina.jpg Why this image is indexed if in the robots.tx...

Scraping and web crawling framework, PHP

Hi, I know scrapy.org that is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. I used it in some projects and it is very simple to use. But it is written in python. My question is, are there simlar frameworks for php? ...

Is there a server-side dom engine suitable for crawling?

I found a project, jaxer which embeds Firefox's JavaScript engine on the server side, so it can parse HTML server-side very well. But, this project seems dead. It is really helpful for crawling web pages to parse HTML & extract data. Is there some new technology useful for extracting information? ...

PHP Web Spider Problems

Hey Guys, i have a Problem building a Web Spider in PHP, which is able to crawl hundreds of websites. I tried several approaches...one with the snoopy browser class, one with Simple HTML DOM Parser and one with the sfWebBrowserPlugin for Symfony. I run into the same problem with all aproaches. My crawler crawls a site in 3 stages...cate...

After doing HttpWebRequests for a while the result starts timing out

I have an application that spiders websites for information. It seems like after 20-45 minutes of creating HttpWebRequests a bunch of them return timeouts. One thing we do is attach a BindIPDelegate anonymous function to give the request a specific IP since we round-robin through about 150 IPs. I'm setting up the HttpWebRequest object...

Mechanize::Firefox gets stuck

I'm using WWW::Mechanize::Firefox to crawl pages that load some JavaScript after they have been loaded. My code regarding this problem: my ($firemech) = WWW::Mechanize::Firefox->new(tab => 'current', ); $firemech->get($url); die "Cannot connect to $url\n" if !$firemech->success(); print "I'm connected!\n"; my ($retries) = 10; while ($...

Can I allow indexing (by search engines) of restricted content without making it public?

Hi, I have a site with some restricted content. I want my site to appear in search results, but I do not want it to get public. Is there a way by which I can allow crawlers to crawl through my site but prevent them from making it public? The closest solution I have found is Google First Click Free but even it requires me to show the c...

problems in a ruby screen-scraping script

Hi! I have a small crawler/screen-scraping script that used to work half a year ago, but now, it doesnt work anymore. I checked the html and css values for the reg expression in the page source, but they are still the same, so from this point of view, it should work. Any guesses? require "open-uri" # output file f = open 'results.csv'...

C# web and ftp crawler library

Hi! I need a library (hopefully in C#!) which works as a web crawler to access HTTP files and FTP files. In principle, Im happy with reading HTML, I want to extend it to PDF, WORD, etc.. Im happy with a starter's open source software or at least any directions for documentation. Best regards, David ...

Is there a service or website to get content pertinent to a specified City/State or Zip

I have been looking around for a website that can automatically provide me we content relative to either a city/state combo, or a zip code. Essentially I want to have a bit of content pertinent to where my user actually is. Does anybody know of any online services that provide something like this? I also wouldn't be opposed to spidering ...

Using Python to download a document that's not explicitly referenced in a URL

I wrote a web crawler in Python 2.6 using the Bing API that searches for certain documents and then downloads them for classification later. I've been using string methods and urllib.urlretrieve() to download results whose URL ends in .pdf, .ps etc., but I run into trouble when the document is 'hidden' behind a URL like: http://www.oecd...

Facebook Permission request form for Crawling?

I have been Googling for sometime but I guess I am using the wrong set of keywords. Does anyone know this URI that lets me request permission from Facebook to let me crawl their network? Last time I was using Python to do this, someone suggested that I look at it but I couldn't find that post either. ...

Does googlebot crawl urls in jQuery $.get() calls and can it be prevented?

Hi, I have a page that has a form using this ajaxForm jQuery plugin. The form submits, and when it's complete, there is a call using $.get() to load some new content to the page. My problem is, the Googlebot "appears" to be indexing the url in the $.get() method. My first question is, is that even possible? I was under the impression...

Preventing my PHP Web Crawler from Stalling

I'm using the PHPCrawl class and added some DOMDocument and DOMXpath code to take specific data off web pages however the script stalls out before it gets even close to crawling the whole website. I have set_time_limit set to 100000000 so that shouldn't be an issue. Any ideas? Thank you, Nick <?php // It may take a while to crawl a...

PHP Crawler for a asp.net site

I wanna write a crawler to fetch data. from an asp.net site which uses javascript to do the pagination ...

How to index a Blog as a search engine ?

I want to create a simple search engine for learning purpose. I want to know how to index a simple blog site. A blog site has many pages and in every page there is a blogpost. But in every page there are other stuff in common as well ( header, footer, category block and other stuff ). In your opinion, How can I index this blog ? The ...