tags:

views:

40

answers:

2

I'm a Perl programmer with some nice scripts that go fetch HTTP pages (from a text file-list of URLs) with cURL and save them to a folder.

However, the number of pages to get is in the tens of millions. Sometimes the script fails on number 170,000 and I have to start the script again manually. It automatically reads the URL and sees if there is a page downloaded and skips. But, with a few hundred thousand, it still takes a few hours to skip back up to where it left off. Obviously, this is not going to pan out in the end.

So, I'm thinking a solution is to build a Visual Basic program that opens the command prompts, collects console output and restarts the script if needed at the last missed number.

I've never made a VB program, but I hear it's cake. Could I get a layman's explanation of how to do this (open prompts, send commands, capture output, restart prompts)? Or is there a better way to solve my problem?

+1  A: 

My suggestion would be to forget the VBA and also the cURL and use either the LWP or Mech perl modules to get your pages. You can then handle the error gracefully in your script without needing to resort to VB.

Auctionitis
Sometimes it simply fails because it can't make a folder for some reason. Is there a way to encapsulate the entire script to catch all minor errors and keep it running?
Sho Minamimoto
@Sho: sure, don't `die` on errors you don't want to die from :) And wrap `try {}` blocks around things that might die on their own, check the error and decide whether to continue.
Ether
Ah, I've been trying try{} or do {} blocks, thanks for that tip.
Sho Minamimoto
+2  A: 

Change how you are doing things. Maintain the queue of pages to check outside of the script. When you check one, mark it as viewed and record the date that you checked it.

When you restart your script, reset the queue to just the pages that have been marked as checked outside the time window.

A database might be in handy here.

Fix the problem and you don't have to build a lot of junk around the problem.

You say that sometimes you can't create a directory. That should be an easy problem to catch. However, that doesn't mean that you can ignore it in your script. Not at errors are recoverable, but at least you can log the problem so you can investigate. How are you creating directories?

brian d foy
I'm creating directories based on a person's first name. Say the name 'Adam' is given. In a master directory, I make the 'a' folder. Inside that 'a' folder, I make 'ad'. Inside 'ad', I make 'ada'. Then I save the file in that last one, and I always have a unique file name. The thing is, I don't know a damn thing about databases. All I know is to make a text file and save things on a line, separated by some arbitrary delimiter.
Sho Minamimoto
Well, time to learn about databases then, isn't it? SQLite is a good place to start.
brian d foy
To create a multiple path, use File::Path's make_path. It creates the whole structure at once so you don't have to do it stepwise yourself. It's like `mkdir -p`.
brian d foy
Yes, I already use makepath, it makes things easy. Do you have any good modules for perl to interact with databases? I just need as simple and quick as it gets, I'm not going to link databases or construct fancy calls. Just one row per page and columns for data, and it has to be fast to traverse.
Sho Minamimoto