+1  A: 

[Edited to reflect the fact that the original poster can't scp into his local computer from the server; I assume it's behind NAT or something of the sort]

[Edit 2: I'm keeping the current tunnel-based answer, for reference; but, since the original poster is unable to ssh back into his local machine, I'll assume something else is blocking the tunnel. See the suggestion at the end].

Ok, you'll need to open up a tunnel between the server and your home computer. So, ssh from your local computer (I assume it's Unix-based, you mentioned is a Mac, so that's fine) into the server with this command:

ssh -R 10022:localhost:22 your_server_address

In brief, this will forward the server's port 10022 (it's a high (> 1024) port, so it's likely to be available) to your local computer's port 22 (which is where ssh usually listens). That is, once you've done that, if you ssh into the server's 10022 port, you're actually sshing into your local computer. If you want to test it, from the server, do:

ssh -p 10022 localhost

login with your local computer's username and password, and you should see its shell prompt. If you do this test, remeber to log out, so as not to confuse yourself.

Once you've opened the tunnel, keep that connection open. You may use it to run the bash command line that downloads the PDF etc, but that's not necessary.

Then, try the following command-line:

while read line; do python python_script.py -l "$line"; scp -P 10022 *.pdf localhost:path/to/put/files/; rm *.pdf; done < pdfURLs.txt

A few things to keep in mind:

  • This waits until scp has finished and only then will the python script downloaded the next PDF. You mentioned you effectively wanted this, not to keep the PDF files on the server for long.
  • This copies all PDF files from the current directory to your local computer (and then erases them), so preferably run this from a previously empty directory.
  • I assume you can scp without having to type a password (using shared key authentication, for instance), otherwise it might get a bit annoying, having to retype your password all the time.

That should do it.

[Edited to add this alternative, for when the tunnel doesn't work]

If that fails, I can only assume something else is blocking your ssh/scp from the server to your local machine. In that case, you may try something different: from you local machine, do

while read line; do ssh -n server_address "cd tmp_download_directory && rm -f *.pdf && python python_script.py -l $line" && scp server_address:tmp_download_directory/*.pdf /local/path/to/put/files/; done < pdfURLs.txt; ssh server_address "rm -f tmp_download_directory/*.pdf"

(The "-n" switch to ssh is necessary, not to feed subsequente $lines into the ssh shell.)

rbp
Let me elaborate a little, the python script is what is doing the downloading it goes to the URL downloads a pdf on the page and then assigns it a new file name using some metadata so a value for $line would be baseURL.com/content/hashcode and then on the remote server the file is saved as pdf_content_title.pdf (and not the same value that was passed into $line). I guess I could modify the python script to assign the name of the file to the URL.
Jordan
@rdp: The subshell is probably not necessary.
Dennis Williamson
rbp
@Jordan: I see what you mean. Does the python script currently print anything? Because it could print the saved filename, and then scp could use that. Otherwise, you might need something a bit more complex.
rbp
@rbp when you say "...until the scp has finished" do you mean that the python script wont run again until the file is fully transferred? Because for my purposes that might be a good thing, the reason I need to transfer the files to my computer is because I don't have enough space on the server for all the pdf's
Jordan
@Jordan: yes, that's what I meant. Ok, that'd just mean removing the parentheses and ampersand from the command. But that still begs the question of whether you can scp *from* the server into your home computer (but, is your home computer running Linux/Unix as well?)
rbp
@rbp: the python script does print something it confirms the download worked as well as where the file was saved to (including full path), how might I pass that into the bash script? Alternatively the python script has a variable assigned to the full path of the new pdf, I might be able to modify the script so that it saves the path to another file that can be fed into the scp command.EDIT: @rbp: I am running OS X so I do have a Unix shell
Jordan
@rbp: It should work just fine without the subshell. Just use ``: `while ... python ...; scp ... done ... `
Dennis Williamson
@Dennis: Indeed :)
rbp
@Jordan: actually, if you're going to wait for scp to finish before downloading the next pdf anyway, and assuming there are no pdf files on the current directory that you don't want to transfer to your computer, you might change 'scp "$line" remote_computer:path/to/put/files; rm "$line"' to 'scp *.pdf remote_computer:path/to/put/files; rm *.pdf'. Does that work for your situation?
rbp
@rbp: I gave your suggestion a try now I am having trouble with the scp step, basically as the server tried to connect to my local computer it got caught trying to access port 22, I am not sure what to do, I don't have a firewall and the remote login and file sharing options are allowed on my mac. Any thoughts?
Jordan
Ok, first: can you ssh from the server into your local machine? What address did you try to scp to (if your personal computer is located at home, most ISPs will give you an address that can't be accessed from the outside). If you can't access your personal computer from the server, we'll need to modify this solution (can be done, it's just slightly more complicated). Also, just to make sure: can you ssh from your personal computer into itself (ssh 127.0.0.1)?
rbp
I have been unsuccessful using ssh from the server to my local machine, I tried to scp to the address my mac gives me from the sharing system preferences pane, namely "[email protected]". I can use ssh into myself though. Also note that I also tried using ssh/scp using different ports in addition to the standard one, I used port scan under network utility on my mac to find what I thought would be open ports.
Jordan
Right. Your personal computer is probably using an unreachable IP (starting with something like 192.168.something, ou 10.something). You can probably still bypass it, I'll test it and get back to you.
rbp
Ok, I've edited the answer to deal with that. Please try it and see if it works. I'm simply scping (and removing) "*.pdf", instead of trying to recover the exact name of each downloaded file. So please do this from an empty directory :)
rbp
Unfortunately I am still having issues, I tried the edited answer but no luck. I tried different ports once again and still no luck. What I was able to do is use scp from my local computer to the remote server, so I was able to copy a file from the server to my computer (without first using ssh to get into the server and then using scp to copy from the server to the local computer).
Jordan
Try the solution I've added to the end of my answer (I've kept the original one for archiving purposes, since it's a more versatile one for when the ssh tunnel is allowed). Please note that now the bash loop run on the *local* machine.
rbp
IT WORKED!!!! Thank you for all your patience and help!
Jordan
Cool, I'm glad to know :) Please, if this indeed worked for you, don't forget to mark this as the accepted answer (I assume you're new here, so this is just a reminder, don't feel pressured if you're not satisfied)
rbp