views:

81

answers:

3

I just made a script to grab links from a website, and in turn saves them into a text file.

Now I'm working on my regexes so it will grab links which contains php?dl= in the url from the text file:

E.g.: www.example.com/site/admin/a_files.php?dl=33931

Its pretty much the address you get when you hover over the dl button on the site. From which you can click to download or "right click save".

I'm just wondering on how to achieve this, having to download the content of the given address which will download a *.txt file. All from the script of course.

+2  A: 

Crawling in Perl - A Quick Tutorial

Hal
+4  A: 

Make WWW::Mechanize your new best friend.

Here's why:

  • It can identify links on a webpage that match a specific regex (/php\?dl=/ in this case)
  • It can follow those links through the follow_link method
  • It can get the targets of those links and save them to file

All this without needing to save your wanted links in an intermediate file! Life's sweet when you have the right tool for the job...


Example

use strict;
use warnings;
use WWW::Mechanize;

my $url  = 'http://www.example.com/';
my $mech = WWW::Mechanize->new();

$mech->get ( $url );

my @linksOfInterest = $mech->find_all_links ( text_regex => qr/php\?dl=/ );

my $fileNumber++;

foreach my $link (@linksOfInterest) {

    $mech->get ( $link, ':contentfile' => "file".($fileNumber++).".txt" );
    $mech->back();
}
Zaid
Awesome! you stated all the things I have been looking for, for the past 2 hours lol. Thank you :D
eraldcoil
This helped alot. Thank you very much :D. I have so much to learn still, thnx for pointing out this very helpful module :D
eraldcoil
I see no reason in this example to do the ->back() and ->reload().
Andy Lester
@Andy : I suppose it depends on the page in question. If it updates frequently, a `reload()` may be prudent.
Zaid
@Zaid: You're not doing anything with the reloaded page. @linksofInterest doesn't change.
Andy Lester
@Andy : Good point. The `->reload()` is useless for the example in question.
Zaid
+2  A: 

You can download the file with LWP::UserAgent:

my $ua = LWP::UserAgent->new();  
my $response = $ua->get($url, ':content_file' => 'file.txt');  

Or if you need a filehandle:

open my $fh, '<', $response->content_ref or die $!;
eugene y
ahhh ic so that's how you use it. Thanks :D
eraldcoil
Or, just use 'LWP::Simple::getstore($url, $file)`.
Sinan Ünür