views:

666

answers:

7

Question: Are there Windows API calls (perhaps NTFS only) which allows one to split a very large file into many others without actually copying any data (in other words, specify the logical breakpoints between joined files, with file names and sizes)?

Examples: SetFileValidData, NtSetInformationFile

Scenario: I need to programatically distribute/copy 10GB of files from a non-local drive (including network, USB and DVD drives). This is made up of over 100,000 individual files with median size about 16kbytes, but joined into ~2GB chunks.

However, using simple FileStream api's (64kb buffer) to extract files from the chunks on non-local drives to individual files on a local hard drive seems to be limited on my machine to about 4MB/s, whereas copying the entire chunks using Explorer occurs at over 80MB/s!

It seems logical to copy entire chunks, but give Windows enough info to logically split the files (which theoretically should be able to happen very, very fast).

Doesn't the Vista install do something like this?

A: 

Is there a reason you can't invoke the OS's copy routines to do the copying? That should do the same thing that Explorer does. It negates the need for your weird splitting thing, which I don't think exists.

rmeador
Direct OS CopyFile routines are slightly faster than my own routines in copying the 100,000 files, yet still the performance is horrible (order of magnitude slower) compared to copying the files merged together. Hence the desire to copy them merged, but split after copy.
tikinoa
+3  A: 

Although there Volume Shadow Copies, these are an all-or-nothing approach - you can't cut out just part of a file. They are also only temporary. Likewise, hard links share all content, without exceptions. Unfortunately, cutting out just parts of a file is not supported on Windows, although some experimental Linux filesystems such as btrfs support it.

bdonlan
A: 
  • If you copy a large number of small files, it usually helps to temporarily disable your virus scanner until your copy operation is done.

  • There's a tool called TeraCopy that claims to solve your problem (seems to be made with Delphi): http://blog.codesector.com/category/code-sector-software/teracopy/ http://www.codesector.com/teracopy.php

  • There's another tool called FastCopy that might also work. This one can be called from the commandline, which could be an advantage if you need to integrate this with a Delphi program: http://www.ipmsg.org/tools/fastcopy.html.en

  • I've noticed that the direction of copying a large number of files sometimes makes a huge difference. It's a lot faster to copy files away from the machine than it is to initiate the copy from the destination machine.

  • If all else fails you could always write your own client-server application that streams files using your own optimized protocol. One app with readfromdisk/sendovernetwork threads, and one with receivefromnetwork/writetodisk threads.

Wouter van Nifterick
So, is there a way to split a large file without copying it?
Rob Kennedy
I don't think there is not even at low level disk/ntfs - what you need is a "companion file" full of pointers to parts of the "Giga-file" but with average file size of circa 16k the pointer file would be quite large too!
Despatcher
@Rob: Others answered that already. I read the question and realised that the OP doesn't especially want to split a file without copying it. He wants to copy a large number of files fast. I thought it would be helpful to assist him in getting that job done.
Wouter van Nifterick
+2  A: 

You can't in practice. The data has to physically move, if any new boundary would not coincide with an existing cluster boundary.

For a high-speed copy, read the input file in asynchronously, break it up in your 16KB segments, post those to a queue (in memory) and set up a threadpool to empty the queue by writing out those 16KB segments. Considering those sizes, the writes probably can be synchronous. Considering the speed of local I/O and remote I/O, and the fact that you have multiple writer threads, the chance of your queue overflowing should be quite low.

MSalters
A: 

A thought on this: Is there enough space to copy the large chunk to a local drive and then work on it using it as a Memory Mapped file? I remember a discussion somewhere some-when that these files are very much faster as they use the windows File/page cache and are easy to set up.

From Wikipedia and from StackOverflow

Despatcher
A: 

Perhaps this technique would work for you: Copy the large chunks (using the already established efficient method), then use something like the following script to split the large chunks into smaller chunks locally.

from __future__ import division
import os
import sys
from win32file import CreateFile, SetEndOfFile, GetFileSize, SetFilePointer, ReadFile, WriteFile
import win32con
from itertools import tee, izip, imap

def xfrange(start, stop=None, step=None):
 """
 Like xrange(), but returns list of floats instead

 All numbers are generated on-demand using generators
 """

 if stop is None:
  stop = float(start)
  start = 0.0

 if step is None:
  step = 1.0

 cur = float(start)

 while cur < stop:
  yield cur
  cur += step


# from Python 2.6 docs
def pairwise(iterable):
 "s -> (s0,s1), (s1,s2), (s2, s3), ..."
 a, b = tee(iterable)
 next(b, None)
 return izip(a, b)

def get_one_hundred_pieces(size):
 """
 Return start and stop extents for a file of given size
 that will break the file into 100 pieces of approximately
 the same length.

 >>> res = list(get_one_hundred_pieces(205))
 >>> len(res)
 100
 >>> res[:3]
 [(0, 2), (2, 4), (4, 6)]
 >>> res[-3:]
 [(199, 201), (201, 203), (203, 205)]
 """
 step = size / 100
 cap = lambda pos: min(pos, size)
 approx_partitions = xfrange(0, size+step, step)
 int_partitions = imap(lambda n: int(round(n)), approx_partitions)
 partitions = imap(cap, int_partitions)
 return pairwise(partitions)

def save_file_bytes(handle, length, filename):
 hr, data = ReadFile(handle, length)
 assert len(data) == length, "%s != %s" % (len(data), length)
 h_dest = CreateFile(
  filename,
  win32con.GENERIC_WRITE,
  0,
  None,
  win32con.CREATE_NEW,
  0,
  None,
  )
 code, wbytes = WriteFile(h_dest, data)
 assert code == 0
 assert wbytes == len(data), '%s != %s' % (wbytes, len(data))

def handle_command_line():
 filename = sys.argv[1]
 h = CreateFile(
  filename,
  win32con.GENERIC_WRITE | win32con.GENERIC_READ,
  0,
  None,
  win32con.OPEN_EXISTING,
  0,
  None,
  )
 size = GetFileSize(h)
 extents = get_one_hundred_pieces(size)
 for start, end in reversed(tuple(extents)):
  length = end - start
  last = end - 1
  SetFilePointer(h, start, win32con.FILE_BEGIN)
  target_filename = '%s-%d' % (filename, start)
  save_file_bytes(h, length, target_filename)
  SetFilePointer(h, start, win32con.FILE_BEGIN)
  SetEndOfFile(h)

if __name__ == '__main__':
 handle_command_line()

This is a Python 2.6 script utilizing pywin32 to utilize the Windows APIs. The same technique could be implemented in Delphi or C++ easily enough.

The main routine is in handle_command_line. It takes a filename, and splits that filename into chunks based on the get_one_hundred_pieces function. Your application would substitute a more appropriate function to determine the appropriate extents.

It then copies the chunk into its own file and calls SetEndOfFile to shrink the larger file (since the content is now in its own file).

I have tested this against a 1GB file broken into 100 pieces and it ran in less than 30 seconds. Furthermore, this should theoretically run in a space-efficient manner (not consuming more than the total file size plus the largest chunk size at any given time). I suspect there are performance improvements, but this is mostly a proof of concept.

Jason R. Coombs
A: 

You can copy second chunk of the file into new file and than truncate original file. In this approach you are copying only a half of file.

denisenkom