views:

69

answers:

6

Can you point me on an idea of how to get all the HTML files in a subfolder and all the folders in it of a website? For example: www.K.com/goo

I want all the HTML files that are in: www.K.com/goo/1.html, ......n.html

Also, if there are subfolders so I want to get also them: www.K.com/goo/foo/1.html...n.html

A: 

Read perldoc File::Find, then use File::Find.

aschepler
He's asking about retrieving pages from a website, not from the filesystem.
cjm
A: 

I would suggest using the wget program to download the website rather than perl, it's not that well suited to the problem.

Mimisbrunnr
But I need to do it with perl, not shell. I'm familiar with wget and its abilities. Thanks,
soulSurfer2010
+1  A: 

Look at lwp-mirror and follow its lead.

Shlomi Fish
+2  A: 

Assuming you don't have access to the server's filesystem, then unless each directory has an index of the files it contains, you can't be guaranteed to achieve this.

The normal way would be to use a web crawler, and hope that all the files you want are linked to from pages you find.

ishnid
A: 

There are also a number of useful modules on CPAN which will be named things like "Spider" or "Crawler". But ishnid is right. They will only find files which are linked from somewhere on the site. They won't find every file that's on the file system.

AmbroseChapel
A: 

You can also use curl to get all the files from a website folder. Look at this man page and go to the section -o/--output which gives u a good idead about that. I have used this a couple of times.

Raghuram
But I need to do it with perl, not shell. Thanks,
soulSurfer2010
I tried something like this from my perl code and it woked fine. my @cmd = "curl --proxy proxy.net -D cmd_status.out -s -o $OUT_FILE_NAME \"file_list\""; print "@cmd\n";open(PCMD,join("",@cmd) . "|"); my $content = join("",<PCMD>);close(PCMD);After this you can check the status file for the command status
Raghuram