tags:

views:

190

answers:

7

I have a bash file that contains wget commands to download over 100,000 files totaling around 20gb of data.

The bash file looks something like:

wget http://something.com/path/to/file.data

wget http://something.com/path/to/file2.data

wget http://something.com/path/to/file3.data

wget http://something.com/path/to/file4.data

And there are exactly 114,770 rows of this. How reliable would it be to ssh into a server I have an account on and run this? Would my ssh session time out eventually? would I have to be ssh'ed in the entire time? What if my local computer crashed/got shut down?

Also, does anyone know how many resources this would take? Am I crazy to want to do this on a shared server?

I know this is a weird question, just wondering if anyone has any ideas. Thanks!

+1  A: 

Depends on the reliability of the communication medium, hardware, ...!

You can use screen to keep it running while you disconnect from the remote computer.

Mehrdad Afshari
A: 

Start it with

nohup ./scriptname &

and you should be fine. Also I would recommend that you log the progress so that you would be able to find out where it stopped if it does.

wget url >>logfile.log

could be enough.

To monitor progress live you could:

tail -f logfile.log
Jonas Elfström
Mehrdad Afshari
Thanks, totally forgot.
Jonas Elfström
A: 

You want to disconnect the script from your shell and have it run in the background (using nohup), so that it continues running when you log out.

You also want to have some kind of progress indicator, such as a log file that logs every file that was downloaded, and also all the error messages. Nohup sends stderr and stdout into files. With such a file, you can pick up broken downloads and aborted runs later on.

Give it a test-run first with a small set of files to see if you got the command down and like the output.

Thilo
A: 

I suggest you detach it from your shell with nohup.

$ nohup myLongRunningScript.sh > script.stdout 2>script.stderr &
$ exit

The script will run to completion - you don't need to be logged in throughout.

Do check for any options you can give wget to make it retry on failure.

slim
+4  A: 

Use

#nohup ./scriptname &>logname.log

This will ensure

  • The process will continue even if ssh session is interrupted
  • You can monitor it, as it is in action

Will also recommend, that you can have some prompt at regular intervals, will be good for log analysis. e.g. #echo "1000 files copied"


As far as resource utilisation is concerned, it entirely depends on the system and majorly on network characteristics. Theoretically you can callculate the time with just Data Size & Bandwidth. But in real life, delays, latencies, and data-losses come into picture.

So make some assuptions, do some mathematics and you'll get the answer :)

Mohit Nanda
A: 

If it is possible, generate MD5 checksums for all of the files and use it to check if they all were transferred correctly.

schnaader
How do you that without having the files first? If he can calculate MD5 on the server he downloads from, he probably does not need to resort to wget/http to move them around.
Thilo
A: 

It may be worth it to look at an alternate technology, like rsync. I've used it on many projects and it works very, very well.

Joe Casadonte