views:

1554

answers:

14

I created a program that iterates over a bunch of files and invokes for some of them:

scp <file> user@host:<remotefile>

However, in my case, there may be thousands of small files that need to transferred, and scp is opening a new ssh connection for each of them, which has quite some overhead.

I was wondering if there is no solution where I keep one process running that maintains the connection and I can send it "requests" to copy over single files.

Ideally, I'm looking for a combination of some sender and receiver program, such that I can start a single process (1) at the beginning:

ssh user@host receiverprogram

And for each file, I invoke a command (2):

senderprogram <file> <remotefile>

and pipe the output of (2) to the input of (1), and this would cause the file to be transferred. In the end, I can just send process (1) some signal to terminate.

Preferably the sender and receiver programs are open source C programs for Unix. They may communicate using a socket instead of a pipe, or any other creative solution.

However, it is an important constraint that each file gets transferred at the moment I iterate over it: it is not acceptable to collect a list of files and then invoke one instance of scp to transfer all the files at once at the end. Also, I have only simple shell access to the receiving host.

Update: I found a solution for the problem of the connection overhead using the multiplexing features of ssh, see my own answer below. Yet, I'm starting a bounty because I'm curious to find if there exists a sender/receiver program as I describe here. It seems there should exist something that can be used, e.g. xmodem/ymodem/zmodem?

+1  A: 

Seems like a job for tar? Pipe its output to ssh, and on the other side pipe the ssh output back to tar.

MSalters
No, tar won't provide his "transfer each file as I iterator over it" requirement.
Darron
Perhaps, if you keep a "tar -x" reading from input, and can send it requests by piping the output of multiple "tar -A" invocations to that input.
Bruno De Fraine
+4  A: 

It might work to use sftp instead of scp, and to place it into batch mode. Make the batch command file a pipe or UNIX domain socket and feed commands to it as you want them executed.

Security on this might be a little tricky at the client end.

Darron
+2  A: 

Use rsync over ssh if you can collect all the files to send in a single directory (or hierarchy of directories).

If you don't have all the files in a single place, please give some more informations as to what you want to achieve and why you can't pack all the files into an archive and send that over. Why is it so vital that each file is sent immediately? Would it be OK if the file was sent with a short delay (like when 4K worth of data has accumulated)?

Aaron Digulla
It's OK to buffer data while copying, but when 'senderprogram' program completes successfully, I want to be certain that the file has been stored on the remote host. The reason is that I want to mark the copy status in a table; the table influences the behavior when iterating the next files.
Bruno De Fraine
+20  A: 

I found a solution from another angle. Since version 3.9, OpenSSH supports session multiplexing: a single connection can carry multiple login or file transfer sessions. This avoids the set-up cost per connection.

For the case of the question, I can first open a connection with sets up a control master (-M) with a socket (-S) in a specific location. I don't need a session (-N).

ssh user@host -M -S /tmp/%r@%h:%p -N

Next, I can invoke scp for each file and instruct it to use the same socket:

scp -o 'ControlPath /tmp/%r@%h:%p' <file> user@host:<remotefile>

This command starts copying almost instantaneously!

You can also use the control socket for normal ssh connections, which will then open immediately:

ssh user@host -S /tmp/%r@%h:%p

If the control socket is no longer available (e.g. because you killed the master), this falls back to a normal connection. More information is available in this article.

Bruno De Fraine
The easiest way is still to use sshfs if you use Linux or any other OS that has FUSE support.
ypnos
You can also set this option in the .ssh/config of the user running the job. There you can match on host and provide ControlMaster and ControlPath.
olle
Very handy to know this. I've speeded up a batch script enormously using this technique. Thanks.
Jonathan
+1  A: 

I think that the GNOME desktop uses a single SSH connection when accessing a share through SFTP (SSH). I'm guessing that this is what's happening because I see a single SSH process when I access a remote share this way. So if this is true you should be able to use the same program for this purpose.

The new version of GNOME used GVFS through GIO in order to perform all kind of I/O through different backends. The Ubuntu package gvfs-bin provides various command line utilities that let you manipulate the backends from the command line.

First you will need to mount your SSH folder:

gvfs-mount sftp://user@host/

And then you can use the gvfs-copy to copy your files. I think that all file transfers will be performed through a single SSH process. You can even use ps to see which process is being used.

If you feel more adventurous you can even write your own program in C or in some other high level language that provides an API to GIO.

potyl
+2  A: 

It's a nice little problem. I'm not aware of a prepackaged solution, but you could do a lot with simple shell scripts. I'd try this at the receiver:

#!/bin/ksh
# this is receiverprogram

while true
do
  typeset -i length
  read filename  # read filename sent by sender below
  read size      # read size of file sent
  read -N $size contents  # read all the bytes of the file
  print -n "$contents" > "$filename"
done

At the sender side I would create a named pipe and read from the pipe, e.g.,

mkfifo $HOME/my-connection
ssh remotehost receiver-script < $HOME/my-connection

Then to send a file I'd try this script

#!/bin/ksh
# this is senderprogram

FIFO=$HOME/my-connection

localname="$1"
remotename="$2"
print "$remotename" > $FIFO
size=$(stat -c %s "$localname")
print "$size" > $FIFO
cat "$localname" > $FIFO

If the file size is large you probably don't want to read it at one go, so something on the order of

BUFSIZ=8192

rm -f "$filename"
while ((size >= BUFSIZ)); do
  read -N $BUFSIZE buffer
  print -n "$buffer" >> "$filename"
  size=$((size - BUFSIZ))
done
read -N $size buffer
print -n "$contents" >> "$filename"

Eventually you'll want to extend the script so you can pass through chmod and chgrp commands. Since you trust the sending code, it's probably easiest to structure the thing so that the receiver simply calls shell eval on each line, then send stuff like

print filename='"'"$remotename"'"' > $FIFO
print "read_and_copy_bytes " '$filename' "$size" > $FIFO

and then define a local function read_and_copy_bytes. Getting the quoting right is a bear, but otherwise it should be straightforward.

Of course, none of this has been tested! But I hope it gives you some useful ideas.

Norman Ramsey
Thanks for the ideas! I tried the simplest version, which after some fixes (print -R, length should be size) can transfer one file, but the receiver goes in an endless loop after that: as far as I can tell, `read` seems to return immediately although there is nothing to read yet...
Bruno De Fraine
Sorry about the typos. Once I get the post into my web browser it's impossibly slow and I miss stuff. It's strange that `read` doesn't block. Maybe read -t 864000, which waits ten days to time out?
Norman Ramsey
The timeout doesn't help. I think `read` immediately returns an empty string when the file descriptor from which it reads is closed. Try this: ksh -xc 'while true; do read l; print -R $l; done' < /dev/null
Bruno De Fraine
Hmmm. I thought the semantics of the *named* FIFO was that it should stay open, but I guess I need to go look into that... Sorry!
Norman Ramsey
A: 
rsync -avlzp user@remotemachine:/path/to/files /path/to/this/folder

This will use SSH to transfer files, in a non-slow way

Paul Betts
A: 

Keep it simple, write a little wrapper script that does something like this.

  1. tar the files
  2. send the tar-file
  3. untar on the other side

Something like this:

  1. tar -cvzf test.tgz files ....
  2. scp test.tgz [email protected]:.
  3. ssh [email protected] tar -xzvf test.tgz

/Johan

Johan
I'm all for keeping it simple, but as explained, I can't transfer the files all at once. Also, this seems a less optimal version of `scp -rp files [email protected]`
Bruno De Fraine
The only difference is that you will transfer your data packed with tar and zip so that the total amount of data over the network will be less.
Johan
All right, you mean the compression. Note that ssh connections can be compressed as well with the `-C` flag or the `Compression` option.
Bruno De Fraine
Interesting with -C flag, live and learn.
Johan
+3  A: 

May be you are looking for this: ZSSH

zssh (Zmodem SSH) is a program for interactively transferring files to a remote machine while using the secure shell (ssh). It is intended to be a convenient alternative to scp , allowing to transfer files without having to open another session and re-authenticate oneself.

Luis Melgratti
+1  A: 

One option is Conch is a SSH client and server implementation written in Python using the Twsited framework. You could use it to write a tool which accepts requests via some other protocol (HTTP or Unix domain sockets, FTP, SSH or whatever) and triggers file transfers over a long running SSH connection. In fact, I have several programs in production which use this technique to avoid multiple SSH connection setups.

Dickon Reed
+4  A: 

Have you tried sshfs? You could:

sshfs remote_user@remote_host:/remote_dir /mnt/local_dir

Where

  • /remote_dir was the directory you want to send files to on the system you are sshing into
  • /mnt/local_dir was the local mount location

With this setup you can just cp a file into the local_dir and it would be sent over sftp to remote_host in its remote_dir

Note that there is a single connection, so there is little in the way of overhead

You may need to use the flag -o ServerAliveInterval=15 to maintain an indefinite connection

You will need to have fuse installed locally and an SSH server supporting (and configured for) sftp

alif
+2  A: 

This way would work, and for other things, this general approach is more or less right.

(
iterate over file list
  for each matching file
   echo filename
) | cpio -H newc -o | ssh remotehost cd location \&\& | cpio -H newc -imud
Joshua
Hi Joshua, cpio looks like a very useful building block! However, if I try for example `cpio -H newc -o | { cd out-dir; cpio -H newc -im; }` and then enter a few names on stdin, I notice the previous file is not copied until the next one is entered, since cpio waits to read an entire block?
Bruno De Fraine
Yeah, but if you're not doing it keyboard-interactive it doesn't matter.
Joshua
+1  A: 

There was a very similar question here a couple of weeks ago. The accepted answer proposed to open a tunnel when ssh'ing to the remote machine and to use that tunnel for scp transfers.

innaM
+1  A: 

Perhapse CurlFTPFS might be a valid solution for you.

It looks like it just mounts an external computer's folder to your computer via SFTP. Once that's done, you should be able to use your regular cp commands and everything will be done securely.

Unfortunately I was not able to test it out myself, but let me know if it works for ya!

Edit 1: I have been able to download and test it. As I feared it does require that the client have a FTP server. However, I have found another program which does has exactly the same concept as what you are looking for. sshfs allows you to connect to your client computer without needing any special server. Once you have mounted one of their folders, you can use your normal cp commands to move whatever files you need to more. Once you are done, it should then be a smile matter of umount /path/to/mounted/folder. Let me know how this works out!

Mike