tags:

views:

182

answers:

4

Hello. I'm trying to write my first Perl program. If you think that Perl is a bad language for the task at hand tell me what language would solve it better.

The program tests connectivity between given machine and remote Apache server. At first program requests the directory listing from the Apache server, than it parses the list and downloads all files one by one. Should there be a problem with file (connection resets before reaching the specified Content-Length) this should be logged and next file should be retrieved. There is no need to save the files or even check the integrity, I only need to log the time it takes to complete and all cases where connection resets.

To retrieve the list of links from Apache-generated directory index I plan to use regexp similar to

/href=\"([^\"]+)\"/

The regexp is not debugged yet, indeed.

What is the "reference" way to do HTTP request from Perl? I googled and found examples using many different libraries, some of them commercial. I need something that can detect disconnections (timeout or TCP reset) and handle these.

Another question. How do I store everything caught by my regexp when searching globally as a list of string with the minimal coding effort?

+9  A: 

You have many questions in one. The answer to the question in the title of your post is to use LWP::Simple.

Most of your other questions are answered in perlfaq9 with appropriate pointers to further information.

Sinan Ünür
+3  A: 

As more general answer, Perl is a perfectly fine language for doing HTTP requests, as are a host of other languages. If you're familiar with Perl, don't even hesitate; there are many excellent libraries available to do what you need.

Robert P
+4  A: 

As for the parsing markup with regular expressions part of your question, DON'T!

http://htmlparsing.icenine.ca explains some of the reasons why you shouldn't do this. Although what you're seemingly attempting to parse seems simple, use a proper parser.

genio
Agree with the *DON'T* part. Don't agree with linking to a page that really doesn't explain anything.
Sinan Ünür
Thirded. There are many libraries out there to help you with this, often included in standard Perl distributions. Don't reinvent the wheel, especially when it's a tricky multi-part wheel with six dozen caveats about wheel length, circumference, axle size, and rotating speed maximums!
Robert P
The problem as I see it now meets all preconditions for using regexp listed at page you linked. I need to parse directory listing of known Apache version configured in known way,also I may need functionality to only download files with certain extension and leave other files unchecked which is easy with regex. I already feel bad about using worst coding practices though.
Muxecoid
Although the general case calls for an HTML parser, in this case, the directory listing from Apache, a regex is not so bad. Just use a little perspective. However, my HTML::SimpleLinkExtor module is even easier than a regex. :)
brian d foy
@Muxecoid If those are your requirements, maybe you should use wget ( http://www.gnu.org/software/wget/ ) or cURL ( http://curl.haxx.se/ ).
Sinan Ünür
+9  A: 

As far as the whole problem description goes, I would use WWW::Mechanize. Mechanize is a subclass of LWP::UserAgent that adds stateful behavior and HTML parsing. With mech, you can just do $mech->get($url_of_index_page), and then use $mech->find_all_links(criteria) to select the links to follow.

hobbs
+1 for Mech. It's kewl.
friedo
This should solve the problem, thanks.
Muxecoid
Oops. It doesn't solve the problem. I want to minimize the usage of functions not in the standard library. Is there any baseline function to do so? (Perl 5 build 5.002)
Muxecoid
Perl 5.002 is ancient (almost 15 years old at this point!). You're not going to find standard HTTP modules in its library, and I strongly suggest trying to update to a newer version of perl.
Oesor