web-crawler

how to obtain the second page with file_get_contents

The idea is to get all the page with file_get_contents for a history record. When I do $original_file_div = file_get_contents("http://webpage.com/"); I have a webpage that ask for an email If I enter to the webpage with any browser I see that page, but ... when I press refresh I have access to a new page. I tried to do: $original...

Parse html pages and store the contents(title,text and etc) into Database.

Hi, Does anybody know some open source tools to parse the html pages, filter the Ads,JS and etc to get title, text. Front end of my application is based on LAMP. So I needs to parse the html pages and storage them into Mysql. And populate front pages with these data. I know some tools: Heritrix, Nutch. But it seems that they are crawler...

i want to get http response of a webpage after the ajax response is loaded

usually when v get the response. the html is without the ajax response. coz it is requested latter. but i want the source of the page containing the ajax request's response. ...

Good .NET based open source web crawler?

Hi, For my new project I need to implement a .NET based web crawler. I searched for an open source option and found an entry here at SO that mentioned Arachnode.net as an open source solution. I visited arachnode.net and for my surprise, the project is fully commercial and there is no even a free community edition (if it's really an op...

download link with get vars

How can I download page on this link http://www.kayak.com/s/search/air?ai=kayaksample&do=y&ft=ow&ns=n&cb=e&pa=1&l1=ZAG&t1=a&df=dmy&d1=4/10/2010&depart_flex=exact&r1=y&l2=LON&t2=a&d2=11/10/2010&return_flex&r2=y Link changes to short version (for example www.kayak.com/r/OcJd...

Twill - how do choose multiple selects with same name

I am using twill and python to write a web crawler. showforms() returns Form name=customRatesForm (#1) ## ## __Name__________________ __Type___ __ID________ __Value__________________ 10 originState hidden originState TN 11 destState hidden destState IL 12 originZip text ...

Why can't I fetch www.google.com with Perl's LWP::Simple?

Hi, I cant seem to get this peice of code to work: $self->{_current_page} = $href; my $response = $ua->get($href); my $responseCode = $response->code; if( $responseCode ne "404" ) { my $content = LWP::Simple->get($href); die "get failed: " . $href if (!defined $content); } Will return error: get faile...

How can I make my Perl web crawler go faster?

I've been knocking up a little pet project the last two days which consists of making a crawler in Perl. I have no real experience in Perl (only what I have learned in the past two days). My script is as follows: ACTC.pm: #!/usr/bin/perl use strict; use URI; use URI::http; use File::Basename; use DBI; use HTML::Parser; use LWP::Simple...

How can I handle Javascript in a Perl web crawler?

I would like to crawl a website, the problem is, that its full of JavaScript things, such as buttons and such that when they are pressed, they do not change the URL, but the data on the page is changed. Usually I use LWP / Mechanize etc to crawl sites, but neither support JavaScript. any idea? ...

How to use Scrapy

Hello, I would like to know how can I start a crawler based on Scrapy. I installed the tool via apt-get install and I tried to run an example: /usr/share/doc/scrapy/examples/googledir/googledir$ scrapy list directory.google.com /usr/share/doc/scrapy/examples/googledir/googledir$ scrapy crawl I hacked the code from spiders/google_di...

Am I in the right direction? [Web-Crawling]

Hi, I would like to find all the sites, that have the keyword 'surfing waves' somewhere in their address, very simple! But, without using ANY search engine, which means, writing a pure web-crawler. The problems,I guess, I will face are: It will, obviously, never stop to run... It will come across lots of "garbage" sites before it eve...

HttpWebRequest and HttpWebResponse Error

I am trying to run a simple code for web crawler written in this page . every thing is fine and I tried the program on several sites and it works fine but there is one site instead of returning the html content in its pages it generates a srtange error : DotNetNuke Error: - Version 04.05.01 Return to main page and the html returned i...

How to extract images from a webpage as Facebook does ?

If I insert in my wall a link like this: http://blog.bonsai.tv/news/il-nuovo-vezzo-della-lega-nord-favorire-i-lombardi-alluniversita/ then facebook extract the image in the post and not the first image in the webpage ( not image logo or other little images for example ) !! How facebook does that ? ...

How to write this crawler in php ?

I need to create a php script. The idea is very simple: When I send a link of a blogpost to this php script, then the webpage is crawled and the first image with the title page are saved on my server. What PHP function I have to use for this crawler ? ...

Scrubyt "next_page" not working with relative links?

Hello all. I'm trying to scrape the the Yellow Pages website. Specifically, this link http://www.yellowpages.com/santa-barbara-ca/restaurants. My code works perfectly except for one small problem. Because the "Next" link to go to the next page of restaurants is a relative link, Scrubyt's "next_page" function doesn't work...apparently...

Web log file analysis software to measure search crawlers

I need to analyze the search engine crawling going on in my site. Is there a good tool for this? I've tried AWStats and Sawmill. But both of those give me very limited insight into the crawling. I need to know information like how many unique/distinct webpages in a section of my site was crawled by a specific crawler within a time period...

BOT/Spider Trap Ideas

I have a client whose domain seems to be getting hit pretty hard by what appears to be a DDoS. In the logs it's normal looking user agents with random IPs but they're flipping through pages too fast to be human. They also don't appear to be requesting any images. I can't seem to find any pattern and my suspicion is it's a fleet of Window...

What technologies would be the best to use for my crawler development

I already have a C# crawler using .NET technologies (a very old IE engine ,a MySQL connector and a lot of tools). My crawler stores and loads data from a MySQL database, where I have the following key tables (I have a lot of tables, but these are of importance in the question): Site, SiteSearch, Data Site: It has the HOME_URL and a lot...

Way to extract html and all download attachments from a website

I woule like to be able to run a script (or something) that will "download" a certain webpage (html) and all of its attachements (word docs) so that I can keep and operate a private collection. Here is the story... There is this site that I use a lot for research. On this site there are many html pages that contain text and download li...

anemone scrubbing a certain page depth

I am not understanding how to use the tentacle part of the anemone. If I am interpreting it right I feel i could use it to only scrub a certain page depth away from the root. Anemone.crawl(start_url) do |anemone| tentacle.new(i think but not working) anemone.on_every_page do |page| puts page.depth puts page.url e...