views:

2897

answers:

13

I have roughly around 5 million small (5-30k) files in a single directory that I would like to copy to another machine on the same gigabit network. I tried using rsync, but it would slow down to a crawl after a few hours of running, I assume due to the fact that rsync has to check the source & destination file each time?

My second thought would be to use scp, but wanted to get outside opinion to see if there was a better way. Thanks!

+7  A: 

I'm sure the fact that you have all FIVE MILLION files in a single directory will throw many tools into a tizzy. I'm not surprised that rsync didn't handle this gracefully - it's quite a "unique" situation. If you could figure out a way to structure the files into some sort of directory structure, I'm sure the standard sync tools such as rsync would be much more responsive.

However, just to give some actual advice - perhaps one solution would be to move the drive physically into the destination machine temporarily so you can do a copy of the files in the actual server (not over the network). Then, move the drive back and use rsync to keep things up to date.

Marc Novakowski
+1 for moving drive physically, it's way faster this way
Robert Gould
It sure beats copying everything on a jump drive and going back and forth...
VirtuosiMedia
+1  A: 

I'd see how a zip->copy->unzip performs

or whatever your favorite compression/archive system is.

Keith Nicholas
yeah compressing them into one file would be a good idea too
Robert Gould
even just a tarball
Joel Coehoorn
+1  A: 

Pack them into a single file before you copy it, then unpack them again after it's copied.

ChrisW
+3  A: 

Robocopy is great for things like this. It will try again after network timeouts and it also allows you set an inter-packet gap delay to now swamp the pipe.

[Edit]

Note that this is a Windows only application.

Scott Muc
Assuming you are on windows of course. The nice thing about robocopy is that the app is responsible for iterating over the files, Problem with unix utils is that you might run out of shell space expanding the names.
Martin Beckett
+17  A: 

Something like this should work well:

tar c some/dir | gzip - |  ssh host2 tar xz

Maybe also omit gzip and the "z" flag for extraction, since you are on a gigabit network.

sth
Is it necessary to gzip it, or does ssh compress the stream anyway? Or can be made to do it?
Thilo
ssh will compress the stream if you pass "-C". Over a lan I wouldn't bother with compressing the stream; over the Internet I probably would, unless it were already compressed.
Commodore Jaeger
Yes, you're right, probably no compression necessary. Edited.
sth
+1 since tarring/untarring will also provide an implicit integrity check
Ates Goral
Over the lan you probably don't need ssh encryption either. Rsh isn't very common anymore though.
Mark James
Personally I would leave gzip on: even over gigabit ethernet the bottleneck is very unlikely to be the CPU.
Benji XVI
A: 

you can try the following (may be in batches of files)

  • tar the batch of files
  • gzip them
  • copy using scp if possible
  • gunzip
  • untar the files
kal
A: 

As suggested by sth you could try tar over ssh.

If you do not require encryption (originally you used rsync, but didn't mention it was rsync+ssh) you could try tar over netcat to avoid the ssh overhead.

Of course you can also shorten the time it takes by using gzip or other compression method.

mgv
+2  A: 

You know, I plus-1'd the tar solution, but -- depending on the environment -- there's one other idea that occurs. You might think about using dd(1). The speed issue with something like this is that it takes many head motions to open and close a file, which you'll be doing five million times. In you could ensure that these are assigned contguously, you could dd them instead, which would cut the number of head motions by a factor of 5 or more.

Charlie Martin
+1  A: 

I know this may be stupid - but have you thought of just copying them onto an external disk and carrying it over to the other server? It may actually be the most efficient and simple solution.

Elijah
A: 

Already tons of good suggestions, but wanted to throw in Beyond Compare. I recently transferred about 750,000 files between 5KB and 20MB from one server to another over a gigabit switch. It didn't even hiccup at all. Granted it took a while, but I'd expect that with so much data.

DavGarcia
A: 

In a similar situation, I tried using tar to batch up the files. I wrote a tiny script to pipe the output of the tar command across to the target machine directly in to a receiving tar process which unbundled the files.

The tar approach almost doubled the rate of transfer compared to scp or rsync (YMMV).

Here are the tar commands. Note that you’ll need to enable r-commands by creating .rhosts files in the home directories of each machine (remove these after they copy is complete - they are notorious security problems). Note also that, as usual, HP-UX is awkward - whereas the rest of the world uses ‘rsh’ for the remote-shell command, HP-UX uses ‘remsh’. ‘rsh’ is some kind of restricted shell in HP parlance.

box1> cd source_directory; tar cf - . | remsh box2 "cd target_directory; tar xf - "

The first tar command creates a file called ‘-’, which is a special token meaning ’standard output’ in this case. The archive created contains all the files in the current directory (.) plus all subdirectories (tar is recursive by default). This archive file is piped into the remsh command which sends it to the box2 machine. On box 2 I first change to the proper receiving directory, then I extract from ‘-’, or ’standard input’ the incoming files.

I had 6 of these tar commands running simultaneously to ensure the network link was saturated with data, although I suspect that disk access may have been the limiting factor.

dr-jan
A: 

Super Flexible may work for you as well.

bbqchickenrobot
+1  A: 

We are investigating this issue currently. We need to transfer about 18 million small files - about 200GB total. We achieved the best performance using plain old XCopy, but it still took a LONG time. About 3 Days from 1 server to another, about 2 Weeks to an external drive!

Through another process, we needed to duplicate the server. This was done with Acronis. It took about 3 hours!!!

We will be investigating this some more. The dd suggestion above would probably provide similar results.

Ruz