views:

510

answers:

5

I need to transfer lots of small files to a remote computer within my java program. I was wondering if somebody could suggest the best way to do so... I need to transfer lots of small files and it has to be really fast. Should I use some existing protocol implementation? maybe ftp?

One important thing is that most files would be the same all the time, or the difference would be minor so I was thinking of using git for that purpose. Does anyone have experience with sth like this?

A: 

How do you feel about compressing those files and then using ftp? Do you have possibility to decompress on receiver's side?

Git is version control system, there's no need of adding git's files on top of those files of yours, if you will not check out the files later. I'd rather use ftp.

Here's a nice article about java ftp libraries (or you can use a system call to a console ftp client, but I don't like this idea)

Eedoh
FTP is an option, but as I expect to transfer very large number of small files I expect this to be slow. And the speed at which those files are transfered is absolutely crucial.
markovuksanovic
Well yes, that's why I recommended compressing the files first. It's not hard implementing zip compression in java. And you will transfer one big (zip archive) file rather than a lot of small ones, there's much less overhead. Ftp client will have to do login, logout and one RETR command, and that's all. You will not suffer because of speed this way.P.S.If those files are plain text, then compression ratio will be huge as well, so you will get multiple advantages
Eedoh
And what about the fact that most of those files will be the same (most likely more then 90%) and transfering them, imho, is just an overkill.
markovuksanovic
A: 

Who is receiving the files that you send? another application? You may use a messaging software such as active MQ

or Stick with java net APIs for FTP.

URL url = new URL("ftp://user:password@server/filename;type=i");
URLConnection urlc = url.openConnection();
InputStream is = urlc.getInputStream(); // To download
OutputStream os = urlc.getOutputStream(); // To upload

Wondering why you want to involve git. Does it provide any API to find delta etc? I don't think so. git is a version control system as far as I know.

ring bearer
Files need to be copied on some other machine so that they can be processed by another application. As I already mentioned, I expect to have large number of small files. Most of those files will be the same or with some small diff. There will be very few files that are and need to be completely transfered. Sth like version control system seems like a reasonable option.
markovuksanovic
NO you would be misusing version control system for something trivial.What if there are large number of files? why can't you use JMS or pure FTP? after all, the files are small!?
ring bearer
And what if I had 5000 files out of which only a few hundred need to be transfered (either because they are modified or new). How would JMS perform in that situation?
markovuksanovic
Note that JMS and FTP is only for the data transfer.They would not identify out of 5000 files what needs to be transferred. You will have to build that kind of logic.
ring bearer
A: 

The most efficient way to transfer lots of small files is as an archive; e.g. ZIP or TAR. If your network is relatively slow, compressing the archive before transmission will make a big difference files. But if the network is really fast, compression may actually make the total time to transfer the files longer. The other factor that makes a big difference is the rate at which the file system can read and (especially) create files.

The Git protocol can be really fast, but it achieves this by only sending files that have changed, and (where possible) sending differences instead of complete files. This approach cannot be used for regular file transfer. Rdist and rsync are older UNIX / Linux tools that take the same (differential) approach to transferring files as Git and other version control systems. They won't help you for the same reasons as Git won't ... in general.

Stephen C
Well I actually expect most of the files to be the same and very few of them with some differences. I expect even less files to be new.
markovuksanovic
A: 

The Apache VFS project is a java library that you can use from your program to copy files between file systems.(E.g. copy local files to FTP/SCP/HTTP.)

Copying can be configured so that only files in the source that are newer than the destination are copied, reducing the amount of data sent.

Links

  1. Apache VFS
  2. the file systems supported.
mdma
+3  A: 

From your description, rsync is an absolutely perfect fit for your requirements, much superior to the alternatives that have been offered.

Michael Borgwardt
I'm starting to think so as well. I've found some windows implementation (http://www.itefix.no/i2/node/10650) - but will have to check how it performs before I put it to use.
markovuksanovic
Can rsync be configured, after it has done the initial copy, to delete a remote file, if it has been deleted locally? I have tried using it a little and it seems to leave those files in the remote folder.
markovuksanovic
Oh, --delete does the job :)
markovuksanovic