views:

796

answers:

13

As a maintenance issue I need to routinely (3-5 times per year) copy a repository that is now has over 20 million files and exceeds 1.5 terabytes in total disk space. I am currently using RICHCOPY, but have tried others. RICHCOPY seems the fastest but I do not believe I am getting close to the limits of the capabilities of my XP machine.

I am toying around with using what I have read in The Art of Assembly Language to write a program to copy my files. My other thought is to start learning how to multi-thread in Python to do the copies.

I am toying around with the idea of doing this in Assembly because it seems interesting, but while my time is not incredibly precious it is precious enough that I am trying to get a sense of whether or not I will see significant enough gains in copy speed. I am assuming that I would but I only started really learning to program 18 months and it is still more or less a hobby. Thus I may be missing some fundamental concept of what happens with interpreted languages.

Any observations or experiences would be appreciated. Note, I am not looking for any code. I have already written a basic copy program in Python 2.6 that is no slower than RICHCOPY. I am looking for some observations on which will give me more speed. Right now it takes me over 50 hours to make a copy from a disk to a Drobo and then back from the Drobo to a disk. I have a LogicCube for when I am simply duplicating a disk but sometimes I need to go from a disk to Drobo or the reverse. I am thinking that given that I can sector copy a 3/4 full 2 terabyte drive using the LogicCube in under seven hours I should be able to get close to that using Assembly, but I don't know enough to know if this is valid. (Yes, sometimes ignorance is bliss)

The reason I need to speed it up is I have had two or three cycles where something has happened during copy (fifty hours is a long time to expect the world to hold still) that has caused me to have to trash the copy and start over. For example, last week the water main broke under our building and shorted out the power.

Thanks for the early responses but I don't think it is I/O limitations. I am not going over a network, the drive is plugged into my mother board with a sata connection and my Drobo is plugged into a Firewire port, my thinking is that both connections should allow faster transfer.

Actually I can't use a sector copy except going from a single disk to the Drobo. It won't work the other way since the Drobo file structure is a mystery. My unscientific observation is that the copy from one internal disk to another is no faster than a copy to or from the Drobo to an internal disk.

I am bound by the hardware, I can't afford 10K rpm 2 terabyte drives (if they even make them).

A number of you are suggesting a file synching solution. But that does not solve my problem. First off, the file synching solutions I have played with build a map (for want of a better term) of the data first, I have too many little files so they choke. One of the reasons I use RICHCOPY is that it starts copying immediately, it does not use memory to build a map. Second, I had one of my three Drobo backups fail a couple of weeks ago. My rule is if I have a backup failure the other two have to stay off line until the new one is built. So I need to copy from one of the three back up single drive copies I have that I use with the LogicCube.

At the end of the day I have to have a good copy on a single drive because that is what I deliver to my clients. Because my clients have diverse systems I deliver to them on SATA drives.

I rent some cloud space from someone where my data is also stored as the deepest backup but it is expensive to pull if off of there.

+39  A: 

Copying files is an I/O bound process. It is unlikely that you will see any speed up from rewriting it in assembly, and even multithreading may just cause things to go slower as different threads requesting different files at the same time will result in more disk seeks.

Using a standard tool is probably the best way to go here. If there is anything to optimize, you might want to consider changing your file system or your hardware.

Mark Byers
+1 to minimising disk seeks
gnibbler
+4  A: 

I don't believe it will make a discernable difference which language you use for this purpose. The bottleneck here is not your application but the disk performance.

Just because a language is interpreted, it doesn't mean that every single operation in it is slow. As an example, it's a fairly safe bet that the lower-level code in Python will call assembly (or compiled) code to do copying.

Similarly, when you do stuff with collections and other libraries in Java, that's mostly compiled C, not interpreted Java.

There are a couple of things you can do to possibly speed up the process.

  • Buy faster hard disks (10K RPMs rather than 7.5K or less latency, larger caches and so forth).
  • Copying between two physical disks may be faster than copying on a single disk (due to the head movement).
  • If you're copying across the network, stage it. In other words, copy it fast to another local disk, then slow from there across the net.
  • You can also stage it in a different way. If you run a nightly (or even weekly) process to keep the copy up to date (only copying changed files) rather than three times a year, you won't find yourself in a situation where you have to copy a massive amount.
  • Also if you're using the network, run it on the box where the repository is. You don't want to copy all the data from a remote disk to another PC then back to yet another remote disk.

You may also want to be careful with Python. I may be mistaken (and no doubt the Pythonistas will set me straight if I'm mistaken on this count ) but I have a vague recollection that its threading may not fully utilise multi-core CPUs. In that case, you'd be better off with another solution.

You may well be better off sticking with your current solution. I suspect a specialised copy program will already be optimised as much as possible since that's what they do.

paxdiablo
@paxdiablo, CPython's threading (due to the Global Interpreter Lock aka GIL) doesn't fully utilize multicore CPUs _in CPU-bound tasks_ , but I/O primitives release the GIL, so I/O-bound tasks don't suffer this limitation (for CPU-bound tasks, use the `multiprocessing` module instead of the `threading` module, and all your cores will be fully exploited). None of these issues apply to non-CPython implementations, like IronPython on .NET or Jython on the JVM.
Alex Martelli
+5  A: 

I don't think writing it in assembly will help you. Writing a routine in assembly could help you if you are processor-bound and think you can do something smarter than your compiler. But in a network copy, you will be IO bound, so shaving a cycle here or there almost certainly will not make a difference.

I think the genreal rule here is that it's always best to profile your process to see where you are spending the time before thinking about optimizations.

dsolimano
+7  A: 

There are 2 places for slowdown:

  • Per-file copy is MUCH slower than a disk copy (where you literally clone 100% of each sector's data). Especially for 20mm files. You can't fix that one with the most tuned assembly, unless you switch from cloning files to cloning raw disk data. In the latter case, yes, Assembly is indeed your ticket (or C).

  • Simply storing 20mm files and recursively finding them may be less efficient in Python. But that's more likely a function of finding better algorithm and is not likely to be significantly improved by Assembly. Plus, that will NOT be the main contributor to 50 hrs

In summary - Assembly WILL help if you do raw disk sector copy, but will NOT help if you do filesystem level copy.

DVK
How would assembly help if he does a raw sector copy? As with a "normal" copy operation, it would be just as fast when written in any other language.
jalf
@jalf I think it would help because there would not be any seeks you start reading either outside or inside (?) and just move the head around the disk.
PyNEwbie
@jalf - can you do raw sector copy in Python in the first place? If so, then yes, even in that case Assembly won't offer performance benefits.
DVK
@PyNEwbie - are you suggesting that you want an equivalent to `dd` to do a raw copy of the disk? Random access requires a lot of seeking... sequential access would bring you closer to sequential access on disk - but that's not even guaranteed...
Jeremy Brown
@DVK - not batteries included - but you could use ctypes to use the OS SCSI subsystem or Disk/Storage APIs. There would definitely be a scalar penalty though
Jeremy Brown
@DVK: I've no clue if Python can do it, but then you're conflating two entirely different things. If Python lacks a capability you need, then that is not a "performance issue". It is not that the operation is faster in assembly, but that Python does not enable you to perform the operation in the first place. And then assembly isn't the solution. C or C++ would be a much more reasonable choice then.
jalf
+2  A: 

There's no reason at all to write a copy program in assembly. The problem is with the amount of IO involved not CPU. Also, the copy function in python is already written in C by experts and you won't eek out any more speed writing one yourself in assembler.

Lastly, threading won't help either, especially in python. Go with with either Twisted or just use the new multiprocessing module in Python 2.6 and kick off a pool of processes to do the copies. Save yourself a lot of torment while getting the job done.

Khorkrak
+1  A: 

Right, here the bottleneck is not in the execution of the copying software itself but rather the disk access.

Going lower level does not mean that you will have better performance. Take a simple example of open() and fopen() APIs where open is much lower level is is more direct and fopen() is a library wrapper for the system open() function.

But in reality fopen has better berformance because it adds buffering and optimizes a lot of stuff that is not done in the raw open() function.

Implementing optimizations in assembly level is much harder and less efficient than in python.

Karim
+2  A: 

Before you question the copying app, you should most likely question the data path. What are the theoretical limits and what are you achieving? What are the potential bottlenecks? If there is a single data path, you are probably not going to get a significant boost by parallelizing storage tasks. You may even exacerbate it. Most of the benefits you'll get with asynchronous I/O come at the block level - a level lower than the file system.

One thing you could do to boost I/O is decouple the fetch from source and store to destination portions. Assuming that the source and destination are separate entities, you could theoretically halve the amount of time for the process. But are the standard tools already doing this??

Oh - and on Python and the GIL - with I/O-bound execution, the GIL is really not quite that bad of a penalty.

Jeremy Brown
+6  A: 

As the other answers mention (+1 to mark), when copying files, disk i/o is the bottleneck. The language you use won't make much of a difference. How you've laid out your files will make a difference, how you're transferring data will make a difference.

You mentioned copying to a DROBO. How is your DROBO connected? Check out this graph of connection speeds.

Let's look at the max copy rates you can get over certain wire types:

  • USB = 97 days (1.5 TB / 1.5 Mbps). Lame, at least your performance is not this bad.
  • USB2.0 = ~7hrs (1.5 TB / 480 Mbps). Maybe LogicCube?
  • Fast SCSI = ~40hrs (1.5 TB / 80 Mbps). Maybe your hard drive speed?
  • 100 Mbps ethernet = 1.4 days (1.5 TB / 100 Mbps).

So, depending on the constraints of your problem, it's possible you can't do better. But you may want to start doing a raw disk copy (like Unix's dd), which should be much faster than a file-system level copy (it's faster because there are no random disk seeks for directory walks or fragmented files).

To use dd, you could live boot linux onto your machine (or maybe use cygwin?). See this page for reference or this one about backing up from windows using a live-boot of Ubuntu.

If you were to organize your 1.5 TB data on a RAID, you could probably speed up the copy (because the disks will be reading in parallel), and (depending on the configuration) it'll have the added benefit of protecting you from drive failures.

Stephen
Not many organizations like to take front-line storage offline for copying. `dd` would probably be a stretch :-/ Time for some for some requirements gathering! hehe
Jeremy Brown
@Jeremy - True :) At any rate, I'm sure the guys at serverfault will have plenty of suggestions!
Stephen
+1  A: 

1,5 TB in approximately 50 hours gives a throughput of (1,5 * 1024^2) MB / (50 * 60^2) s = 8,7 MB/s. A theoretical 100 mbit/s bandwidth should give you 12,5 MB/s. It seems to me that your firewire connection is a problem. You should look at upgrading drivers, or upgrading to a better firewire/esata/usb interface.

That said, rather than the python/assembly question, you should look at acquiring a file syncing solution. It shouldn't be necessary copying that data over and over again.

k_b
File sync won't help if the problem is to make fresh copies on fresh hard drives to deliver to clients.
Norman Ramsey
+2  A: 

RICHCOPY is already copying files in parallel, and I expect the only way to beat it is to get in bed with the filesystem so that you minimize disk I/O, especially seeking. I suggest you try ntfsclone to see if it meets your needs. If not, my next suggestion would be to parallize ntfsclone.

In any case, working directly with filesystem layout on disk is going to be easiest in C, not Python and certainly not assembly. Especially since you can get started by using the C code from the NTFS 3G project. This code is designed for reliability and ease of porting, not performance, but it's still probably the easiest way to get started.

My time is precious enough that I am trying to get a sense of whether or not I will see significant enough gains in copy speed.

No. Or more accurately, at your current level of mastery of systems programming, achieving significant improvements in speed will be prohibitively expensive. What you're asking for requires very specialized expertise. Although I myself have prior experience in implementing filesystems (much simpler ones than NTFS, XFS, or ext2), I would not tackle this job; I would hire it done.


Footnote: if you have access to a Linux box, find out what raw write bandwidth you can get to the target drive:

time dd if=/dev/zero of=/dev/sdc bs=1024k count=100

will give you the time to write 100MB sequentially in the fastest possible way. That will give you an absolute limit on what is possible with your hardware. Don't try this without understanding the man page for dd! dd stands for "destroy data". (Actually it stands for "copy and convert", but cc was taken.)

A Windows programmer can probably point you to an equivalent test for Windows.

Norman Ramsey
A: 

As already said, it is not the language here to make the difference; assembly could be cool or fast for computations, but when the processor have to "speak" to peripherals, the limit is given by these. In this case the speed is given by your hard disk speed, and this is a limit you hardly can change wiithout changing your hd and waiting for better hd in future, but also by the way data are organized on the disk, i.e. by the filesystem. AFAIK, most used filesystems are not optimized to handle fastly tons of "small" files, rather they are optimized to hold "few" huge files.

So, changing the filesystem you're using could increase your copy speed, as far as it is more suitable to your case (and of course hd limits still apply!). If you want to "taste" the real limit of you hd, you should try a copy "sector by sector", replycating the exact image of your source hd to the dest hd. (But this option has some points to be aware of)

ShinTakezou
A: 

Since I posted the question I have been playing around with some things and I think first off, not to be argumentative but those of you that have been posting the response that I am i/o bound are only partially correct. It is seek time that is the constraint. Long story to test various options I built a new machine with an I-7 processor and a reasonably powerful/functional motherboard and then using the same two drives I was working with before I noted a fairly significant increase in speed. I also noted that when I am moving big files (one gigabyte or so) I get sustained transfer speeds of in excess of 50 mb/s and the speed drops significantly when moving small files. I think the speed difference is due to an unordered disk relative to the way the copy program reads the directory structure to determine the files to copy.

What I think needs to be done is to 1: Read the MFT and sort by sector working from the outside to the inside of the platter (it means I have to figure out how multi-platter disks work) 2: Analyze and separate all contiguous versus non-contiguous files. I would handle the contiguous files first and go back to handle the non-contiguous files 3: start copying the contiguous files from the outside to inside 4. When finished copy the non-contiguous files, by default they will end up on the inner rings of the platter(s) and they will be contiguous. (I want to note that I do regularly defragment and have less than 1% of my files/directories are fragmented)but 1% of 20 million is still 200K

Why is this better than just running a copy program.

  1. When running a copy program the program is going to use some internal ordering mechanism to determine the copy order. Windows uses alphabetic (more or less) I imagine others do something similar but that order may-not (in my case probably does not) conform to the way the files were initially laid on the disk which is what I beli8eve is the biggest factor that affects copy speed.

  2. The problem with a sector-copy is it does not fix anything and so when I migrate across disk sizes and add data I end up with new problems to handle.

  3. If I do this right I should be able to check file headers and the eof record and do some housekeeping. CHKDSK is a great program but kind of dumb. When I do get file/folder corruption it is really hard to identify what was lost, by building my own copy program I could include a maintenance cycle that I could invoke when I want to run some tests on the files during copying. This might slow it up some but I don't think very much because the CPU is going to move the files much faster than they can get pulled or written. And even if it slows it up some when being run, at least I get some control (maybe understanding is a better word) of the problems that will invariably crop up in an imperfect world.

I may not have to do this in A, I have been looking around for ways to play (read) the MFT and there are even Python tools for this see http://www.integriography.com

PyNEwbie
A: 

Neither. If you want to take advantage of OS features to speed up I/O, you'll need to use some specialized system calls that are most easily accessed in C (or C++). You don't need to know a lot of C to write such a program, but you really need to know the system call interfaces.

In all likelihood, you can solve the problem without writing any code by using an existing tool or tuning the operating system, but if you really do need to write a tool, C is the most straightforward way to do it.

Chris