ansaurus

Question

Should I use Python or Assembly for a super fast copy program

Answer 1

+39 A:

Copying files is an I/O bound process. It is unlikely that you will see any speed up from rewriting it in assembly, and even multithreading may just cause things to go slower as different threads requesting different files at the same time will result in more disk seeks.

Using a standard tool is probably the best way to go here. If there is anything to optimize, you might want to consider changing your file system or your hardware.

Mark Byers 2010-06-06 01:58:09

+1 to minimising disk seeks

gnibbler 2010-06-06 02:12:22

Answer 2

+4 A:

I don't believe it will make a discernable difference which language you use for this purpose. The bottleneck here is not your application but the disk performance.

Just because a language is interpreted, it doesn't mean that every single operation in it is slow. As an example, it's a fairly safe bet that the lower-level code in Python will call assembly (or compiled) code to do copying.

Similarly, when you do stuff with collections and other libraries in Java, that's mostly compiled C, not interpreted Java.

There are a couple of things you can do to possibly speed up the process.

Buy faster hard disks (10K RPMs rather than 7.5K or less latency, larger caches and so forth).
Copying between two physical disks may be faster than copying on a single disk (due to the head movement).
If you're copying across the network, stage it. In other words, copy it fast to another local disk, then slow from there across the net.
You can also stage it in a different way. If you run a nightly (or even weekly) process to keep the copy up to date (only copying changed files) rather than three times a year, you won't find yourself in a situation where you have to copy a massive amount.
Also if you're using the network, run it on the box where the repository is. You don't want to copy all the data from a remote disk to another PC then back to yet another remote disk.

You may also want to be careful with Python. I may be mistaken (and no doubt the Pythonistas will set me straight if I'm mistaken on this count ) but I have a vague recollection that its threading may not fully utilise multi-core CPUs. In that case, you'd be better off with another solution.

You may well be better off sticking with your current solution. I suspect a specialised copy program will already be optimised as much as possible since that's what they do.

paxdiablo 2010-06-06 01:58:41

@paxdiablo, CPython's threading (due to the Global Interpreter Lock aka GIL) doesn't fully utilize multicore CPUs _in CPU-bound tasks_ , but I/O primitives release the GIL, so I/O-bound tasks don't suffer this limitation (for CPU-bound tasks, use the `multiprocessing` module instead of the `threading` module, and all your cores will be fully exploited). None of these issues apply to non-CPython implementations, like IronPython on .NET or Jython on the JVM.

Alex Martelli 2010-06-06 03:06:58

Answer 3

+5 A:

I don't think writing it in assembly will help you. Writing a routine in assembly could help you if you are processor-bound and think you can do something smarter than your compiler. But in a network copy, you will be IO bound, so shaving a cycle here or there almost certainly will not make a difference.

I think the genreal rule here is that it's always best to profile your process to see where you are spending the time before thinking about optimizations.

dsolimano 2010-06-06 02:00:21

Answer 4

+7 A:

There are 2 places for slowdown:

Per-file copy is MUCH slower than a disk copy (where you literally clone 100% of each sector's data). Especially for 20mm files. You can't fix that one with the most tuned assembly, unless you switch from cloning files to cloning raw disk data. In the latter case, yes, Assembly is indeed your ticket (or C).
Simply storing 20mm files and recursively finding them may be less efficient in Python. But that's more likely a function of finding better algorithm and is not likely to be significantly improved by Assembly. Plus, that will NOT be the main contributor to 50 hrs

In summary - Assembly WILL help if you do raw disk sector copy, but will NOT help if you do filesystem level copy.

DVK 2010-06-06 02:02:27

How would assembly help if he does a raw sector copy? As with a "normal" copy operation, it would be just as fast when written in any other language.

jalf 2010-06-06 02:14:11

@jalf I think it would help because there would not be any seeks you start reading either outside or inside (?) and just move the head around the disk.

PyNEwbie 2010-06-06 02:21:28

@jalf - can you do raw sector copy in Python in the first place? If so, then yes, even in that case Assembly won't offer performance benefits.

DVK 2010-06-06 02:23:53

@PyNEwbie - are you suggesting that you want an equivalent to `dd` to do a raw copy of the disk? Random access requires a lot of seeking... sequential access would bring you closer to sequential access on disk - but that's not even guaranteed...

Jeremy Brown 2010-06-06 02:27:51

@DVK - not batteries included - but you could use ctypes to use the OS SCSI subsystem or Disk/Storage APIs. There would definitely be a scalar penalty though

Jeremy Brown 2010-06-06 02:29:07

@DVK: I've no clue if Python can do it, but then you're conflating two entirely different things. If Python lacks a capability you need, then that is not a "performance issue". It is not that the operation is faster in assembly, but that Python does not enable you to perform the operation in the first place. And then assembly isn't the solution. C or C++ would be a much more reasonable choice then.

jalf 2010-06-06 13:02:00

Answer 5

+2 A:

There's no reason at all to write a copy program in assembly. The problem is with the amount of IO involved not CPU. Also, the copy function in python is already written in C by experts and you won't eek out any more speed writing one yourself in assembler.

Lastly, threading won't help either, especially in python. Go with with either Twisted or just use the new multiprocessing module in Python 2.6 and kick off a pool of processes to do the copies. Save yourself a lot of torment while getting the job done.

Khorkrak 2010-06-06 02:03:22

Answer 6

+1 A:

Right, here the bottleneck is not in the execution of the copying software itself but rather the disk access.

Going lower level does not mean that you will have better performance. Take a simple example of open() and fopen() APIs where open is much lower level is is more direct and fopen() is a library wrapper for the system open() function.

But in reality fopen has better berformance because it adds buffering and optimizes a lot of stuff that is not done in the raw open() function.

Implementing optimizations in assembly level is much harder and less efficient than in python.

Karim 2010-06-06 02:06:14

Answer 7

+2 A:

Before you question the copying app, you should most likely question the data path. What are the theoretical limits and what are you achieving? What are the potential bottlenecks? If there is a single data path, you are probably not going to get a significant boost by parallelizing storage tasks. You may even exacerbate it. Most of the benefits you'll get with asynchronous I/O come at the block level - a level lower than the file system.

One thing you could do to boost I/O is decouple the fetch from source and store to destination portions. Assuming that the source and destination are separate entities, you could theoretically halve the amount of time for the process. But are the standard tools already doing this??

Oh - and on Python and the GIL - with I/O-bound execution, the GIL is really not quite that bad of a penalty.

Jeremy Brown 2010-06-06 02:17:03

Answer 8

+6 A:

As the other answers mention (+1 to mark), when copying files, disk i/o is the bottleneck. The language you use won't make much of a difference. How you've laid out your files will make a difference, how you're transferring data will make a difference.

You mentioned copying to a DROBO. How is your DROBO connected? Check out this graph of connection speeds.

Let's look at the max copy rates you can get over certain wire types:

USB = 97 days (1.5 TB / 1.5 Mbps). Lame, at least your performance is not this bad.
USB2.0 = ~7hrs (1.5 TB / 480 Mbps). Maybe LogicCube?
Fast SCSI = ~40hrs (1.5 TB / 80 Mbps). Maybe your hard drive speed?
100 Mbps ethernet = 1.4 days (1.5 TB / 100 Mbps).

So, depending on the constraints of your problem, it's possible you can't do better. But you may want to start doing a raw disk copy (like Unix's dd), which should be much faster than a file-system level copy (it's faster because there are no random disk seeks for directory walks or fragmented files).

To use dd, you could live boot linux onto your machine (or maybe use cygwin?). See this page for reference or this one about backing up from windows using a live-boot of Ubuntu.

If you were to organize your 1.5 TB data on a RAID, you could probably speed up the copy (because the disks will be reading in parallel), and (depending on the configuration) it'll have the added benefit of protecting you from drive failures.

Stephen 2010-06-06 02:31:23

Not many organizations like to take front-line storage offline for copying. `dd` would probably be a stretch :-/ Time for some for some requirements gathering! hehe

Jeremy Brown 2010-06-06 02:36:53

@Jeremy - True :) At any rate, I'm sure the guys at serverfault will have plenty of suggestions!

Stephen 2010-06-06 02:47:06

Answer 9

+1 A:

1,5 TB in approximately 50 hours gives a throughput of (1,5 * 1024^2) MB / (50 * 60^2) s = 8,7 MB/s. A theoretical 100 mbit/s bandwidth should give you 12,5 MB/s. It seems to me that your firewire connection is a problem. You should look at upgrading drivers, or upgrading to a better firewire/esata/usb interface.

That said, rather than the python/assembly question, you should look at acquiring a file syncing solution. It shouldn't be necessary copying that data over and over again.

k_b 2010-06-06 02:40:32

File sync won't help if the problem is to make fresh copies on fresh hard drives to deliver to clients.

Norman Ramsey 2010-06-06 04:42:00

Answer 10

+2 A:

RICHCOPY is already copying files in parallel, and I expect the only way to beat it is to get in bed with the filesystem so that you minimize disk I/O, especially seeking. I suggest you try ntfsclone to see if it meets your needs. If not, my next suggestion would be to parallize ntfsclone.

In any case, working directly with filesystem layout on disk is going to be easiest in C, not Python and certainly not assembly. Especially since you can get started by using the C code from the NTFS 3G project. This code is designed for reliability and ease of porting, not performance, but it's still probably the easiest way to get started.

My time is precious enough that I am trying to get a sense of whether or not I will see significant enough gains in copy speed.

No. Or more accurately, at your current level of mastery of systems programming, achieving significant improvements in speed will be prohibitively expensive. What you're asking for requires very specialized expertise. Although I myself have prior experience in implementing filesystems (much simpler ones than NTFS, XFS, or ext2), I would not tackle this job; I would hire it done.

Footnote: if you have access to a Linux box, find out what raw write bandwidth you can get to the target drive:

time dd if=/dev/zero of=/dev/sdc bs=1024k count=100

will give you the time to write 100MB sequentially in the fastest possible way. That will give you an absolute limit on what is possible with your hardware. Don't try this without understanding the man page for dd! dd stands for "destroy data". (Actually it stands for "copy and convert", but cc was taken.)

A Windows programmer can probably point you to an equivalent test for Windows.

Norman Ramsey 2010-06-06 04:36:05

Answer 11

A:

As already said, it is not the language here to make the difference; assembly could be cool or fast for computations, but when the processor have to "speak" to peripherals, the limit is given by these. In this case the speed is given by your hard disk speed, and this is a limit you hardly can change wiithout changing your hd and waiting for better hd in future, but also by the way data are organized on the disk, i.e. by the filesystem. AFAIK, most used filesystems are not optimized to handle fastly tons of "small" files, rather they are optimized to hold "few" huge files.

So, changing the filesystem you're using could increase your copy speed, as far as it is more suitable to your case (and of course hd limits still apply!). If you want to "taste" the real limit of you hd, you should try a copy "sector by sector", replycating the exact image of your source hd to the dest hd. (But this option has some points to be aware of)

ShinTakezou 2010-06-06 06:49:00

Answer 12

A:

Since I posted the question I have been playing around with some things and I think first off, not to be argumentative but those of you that have been posting the response that I am i/o bound are only partially correct. It is seek time that is the constraint. Long story to test various options I built a new machine with an I-7 processor and a reasonably powerful/functional motherboard and then using the same two drives I was working with before I noted a fairly significant increase in speed. I also noted that when I am moving big files (one gigabyte or so) I get sustained transfer speeds of in excess of 50 mb/s and the speed drops significantly when moving small files. I think the speed difference is due to an unordered disk relative to the way the copy program reads the directory structure to determine the files to copy.

What I think needs to be done is to 1: Read the MFT and sort by sector working from the outside to the inside of the platter (it means I have to figure out how multi-platter disks work) 2: Analyze and separate all contiguous versus non-contiguous files. I would handle the contiguous files first and go back to handle the non-contiguous files 3: start copying the contiguous files from the outside to inside 4. When finished copy the non-contiguous files, by default they will end up on the inner rings of the platter(s) and they will be contiguous. (I want to note that I do regularly defragment and have less than 1% of my files/directories are fragmented)but 1% of 20 million is still 200K

Why is this better than just running a copy program.

When running a copy program the program is going to use some internal ordering mechanism to determine the copy order. Windows uses alphabetic (more or less) I imagine others do something similar but that order may-not (in my case probably does not) conform to the way the files were initially laid on the disk which is what I beli8eve is the biggest factor that affects copy speed.
The problem with a sector-copy is it does not fix anything and so when I migrate across disk sizes and add data I end up with new problems to handle.
If I do this right I should be able to check file headers and the eof record and do some housekeeping. CHKDSK is a great program but kind of dumb. When I do get file/folder corruption it is really hard to identify what was lost, by building my own copy program I could include a maintenance cycle that I could invoke when I want to run some tests on the files during copying. This might slow it up some but I don't think very much because the CPU is going to move the files much faster than they can get pulled or written. And even if it slows it up some when being run, at least I get some control (maybe understanding is a better word) of the problems that will invariably crop up in an imperfect world.

I may not have to do this in A, I have been looking around for ways to play (read) the MFT and there are even Python tools for this see http://www.integriography.com

PyNEwbie 2010-07-05 16:07:54

Answer 13

A:

Neither. If you want to take advantage of OS features to speed up I/O, you'll need to use some specialized system calls that are most easily accessed in C (or C++). You don't need to know a lot of C to write such a program, but you really need to know the system call interfaces.

In all likelihood, you can solve the problem without writing any code by using an existing tool or tuning the operating system, but if you really do need to write a tool, C is the most straightforward way to do it.

Chris 2010-07-05 16:38:28

ansaurus

tags:

views:

answers:

Should I use Python or Assembly for a super fast copy program

related questions