views:

892

answers:

6

Hello! This looks like a simple question, but I didn't find anything similar here.

Since there is no file copy function in C, we have to implement file copying ourselves, but I don't like reinventing the wheel even for trivial stuff like that, so I'd like to ask the cloud:

  1. What code would you recommend for file copying using fopen()/fread()/fwrite()?
  2. What code would you recommend for file copying using open()/read()/write()?

This code should be portable (windows/mac/linux/bsd/qnx/younameit), stable, time tested, fast, memory efficient and etc. Getting into specific system's internals to squeeze some more performance is welcomed (like getting filesystem cluster size).

This seems like a trivial question but, for example, source code for CP command isn't 10 lines of C code.

+1  A: 

Here is a very easy and clear example: Copy a file. Since it is written in ANSI-C without any particular function calls I think this one would be pretty much portable.

merkuro
Sadly, it uses fgetc which is quite inefficient.
David Schmitt
Good point! Although it's very clear and portable it definitely lacks performance.
merkuro
@David: Is fgetc() inefficient? Stdio will do its own buffering using a buffer of size BUFSIZ (8192 bytes on my system). If you're using MSVC++, #define _CRT_DISABLE_PERFCRIT_LOCKS in single-threaded programs.
j_random_hacker
getc() may be marginally faster than fgetc() as it is implemented as a macro, however I doubt that CPU branching will be the bottleneck -- reading from disk will be.
j_random_hacker
@j_random: in my experience of cases where I've tested it, 8k buffers have never achieved optimal I/O performance. Sometimes 16k buffers have been literally twice as fast. Obviously MS don't want every file handle to have a massive memory overhead, so they've compromised, but when copying a file you might want a different compromise. Also, just the book-keeping overhead might be significant (checking bounds, updating the file pointer positions and the like) if you're doing two filehandle ops per character. Only way to know is to write some code and see, ofc.
Steve Jessop
Also of course one's intuition about file I/O depends on whether you think of files as being "generally a few k" or "generally a few gig"...
Steve Jessop
Just found another example which claims to have very high speed: http://webscripts.softpedia.com/scriptDownload/Wb-Fcopy-C-Download-26527.html In any case I would suggest to start with a really easy and readable construct encapsulated into a function at the beginning and if there are any speed problems on your way than spend some time on tuning.
merkuro
@merkuro: the java2s code has some problems. For instance it doesn't close the handles on error, and it doesn't handle EINTR very effectively. If you have to spawn a process to copy a file, then you're straying outside portable, and if this code is to be callable then it needs fixing.
Steve Jessop
@David: I've found OS-supplied buffering insufficent. See my answer below.
T.E.D.
+1  A: 

Depending on what you mean by copying a file, it is certainly far from trivial. If you mean copying the content only, then there is almost nothing to do. But generally, you need to copy the metadata of the file, and that's surely platform dependent. I don't know of any C library which does what you want in a portable manner. Just handling the filename by itself is no trivial matter if you care about portability.

In C++, there is the file library in boost

David Cournapeau
+1  A: 

One thing I found when implementing my own file copy, and it seems obvious but it's not: I/O's are slow. You can pretty much time your copy's speed by how many of them you do. So clearly you need to do as few of them as possible.

The best results I found were when I got myself a ginourmous buffer, read the entire source file into it in one I/O, then wrote the entire buffer back out of it in one I/O. If I even had to do it in 10 batches, it got way slow. Trying to read and write out each byte, like a naieve coder might try first, was just painful.

T.E.D.
+2  A: 

As far as the actual I/O goes, the code I've written a million times in various guises for copying data from one stream to another goes something like this. It returns 0 on success, or -1 with errno set on error (in which case any number of bytes might have been copied).

Note that for copying regular files, you can skip the EAGAIN stuff, since regular files are always blocking I/O. But inevitably if you write this code, someone will use it on other types of file descriptors, so consider it a freebie.

There's a file-specific optimisation that GNU cp does, which I haven't bothered with here, that for long blocks of 0 bytes instead of writing you just extend the output file by seeking off the end.

void block(int fd, int event) {
    pollfd topoll;
    topoll.fd = fd;
    topoll.events = event;
    poll(&topoll, 1, -1);
    // no need to check errors - if the stream is bust then the
    // next read/write will tell us
}

int copy_data_buffer(int fdin, int fdout, void *buf, size_t bufsize) {
    for(;;) {
       void *pos;
       // read data to buffer
       ssize_t bytestowrite = read(fdin, buf, bufsize);
       if (bytestowrite == 0) break; // end of input
       if (bytestowrite == -1) {
           if (errno == EINTR) continue; // signal handled
           if (errno == EAGAIN) {
               block(fdin, POLLIN);
               continue;
           }
           return -1; // error
       }

       // write data from buffer
       pos = buf;
       while (bytestowrite > 0) {
           ssize_t bytes_written = write(fdout, pos, bytestowrite);
           if (bytes_written == -1) {
               if (errno == EINTR) continue; // signal handled
               if (errno == EAGAIN) {
                   block(fdout, POLLOUT);
                   continue;
               }
               return -1; // error
           }
           bytestowrite -= bytes_written;
           pos += bytes_written;
       }
    }
    return 0; // success
}

// Default value. I think it will get close to maximum speed on most
// systems, short of using mmap etc. But porters / integrators
// might want to set it smaller, if the system is very memory
// constrained and they don't want this routine to starve
// concurrent ops of memory. And they might want to set it larger
// if I'm completely wrong and larger buffers improve performance.
// It's worth trying several MB at least once, although with huge
// allocations you have to watch for the linux 
// "crash on access instead of returning 0" behaviour for failed malloc.
#ifndef FILECOPY_BUFFER_SIZE
    #define FILECOPY_BUFFER_SIZE (64*1024)
#endif

int copy_data(int fdin, int fdout) {
    // optional exercise for reader: take the file size as a parameter,
    // and don't use a buffer any bigger than that. This prevents 
    // memory-hogging if FILECOPY_BUFFER_SIZE is very large and the file
    // is small.
    for (size_t bufsize = FILECOPY_BUFFER_SIZE; bufsize >= 256; bufsize /= 2) {
        void *buffer = malloc(bufsize);
        if (buffer != NULL) {
            int result = copy_data_buffer(fdin, fdout, buffer, bufsize);
            free(buffer);
            return result;
        }
    }
    // could use a stack buffer here instead of failing, if desired.
    // 128 bytes ought to fit on any stack worth having, but again
    // this could be made configurable.
    return -1; // errno is ENOMEM
}

To open the input file:

int fdin = open(infile, O_RDONLY|O_BINARY, 0);
if (fdin == -1) return -1;

Opening the output file is tricksy. As a basis, you want:

int fdout = open(outfile, O_WRONLY|O_BINARY|O_CREAT|O_TRUNC, 0x1ff);
if (fdout == -1) {
    close(fdin);
    return -1;
}

But there are confounding factors:

  • you need to special-case when the files are the same, and I can't remember how to do that portably.
  • if the output filename is a directory, you might want to copy the file into the directory.
  • if the output file already exists (open with O_EXCL to determine this and check for EEXIST on error), you might want to do something different, as cp -i does.
  • you might want the permissions of the output file to reflect those of the input file.
  • you might want other platform-specific meta-data to be copied.
  • you may or may not wish to unlink the output file on error.

Obviously the answers to all these questions could be "do the same as cp". In which case the answer to the original question is "ignore everything I or anyone else has said, and use the source of cp".

Btw, getting the filesystem's cluster size is next to useless. You'll almost always see speed increasing with buffer size long after you've passed the size of a disk block.

Steve Jessop
Your sample fails to offset buf by amount already written, which will cause incomplete writes to restart from the top
Hasturkun
Thanks. There's always one bug.
Steve Jessop
+1  A: 

This is the function I use when I need to copy from one file to another - with test harness:

/*
@(#)File:           $RCSfile: fcopy.c,v $
@(#)Version:        $Revision: 1.11 $
@(#)Last changed:   $Date: 2008/02/11 07:28:06 $
@(#)Purpose:        Copy the rest of file1 to file2
@(#)Author:         J Leffler
@(#)Copyright:      (C) JLSS 1991,1997,2000,2003,2005,2008
@(#)Product:        :PRODUCT:
*/

/*TABSTOP=4*/

#include "jlss.h"
#include "stderr.h"

#ifndef lint
/* Prevent over-aggressive optimizers from eliminating ID string */
const char jlss_id_fcopy_c[] = "@(#)$Id: fcopy.c,v 1.11 2008/02/11 07:28:06 jleffler Exp $";
#endif /* lint */

void fcopy(FILE *f1, FILE *f2)
{
    char            buffer[BUFSIZ];
    size_t          n;

    while ((n = fread(buffer, sizeof(char), sizeof(buffer), f1)) > 0)
    {
        if (fwrite(buffer, sizeof(char), n, f2) != n)
            err_syserr("write failed\n");
    }
}

#ifdef TEST

int main(int argc, char **argv)
{
    FILE *fp1;
    FILE *fp2;

    err_setarg0(argv[0]);
    if (argc != 3)
        err_usage("from to");
    if ((fp1 = fopen(argv[1], "rb")) == 0)
        err_syserr("cannot open file %s for reading\n", argv[1]);
    if ((fp2 = fopen(argv[2], "wb")) == 0)
        err_syserr("cannot open file %s for writing\n", argv[2]);
    fcopy(fp1, fp2);
    return(0);
}

#endif /* TEST */

Clearly, this version uses file pointers from standard I/O and not file descriptors, but it is reasonably efficient and about as portable as it can be.


Well, except the error function - that's peculiar to me. As long as you handle errors cleanly, you should be OK. The "jlss.h" header declares fcopy(); the "stderr.h" header declares err_syserr() amongst many other similar error reporting functions. A simple version of the function follows - the real one adds the program name and does some other stuff.

#include "stderr.h"
#include <stdarg.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>

void err_syserr(const char *fmt, ...)
{
    int errnum = errno;
    va_list args;
    va_start(args, fmt);
    vfprintf(stderr, fmt, args);
    va_end(args);
    if (errnum != 0)
        fprintf(stderr, "(%d: %s)\n", errnum, strerror(errnum));
    exit(1);
}
Jonathan Leffler
I like it, simple, clean, works. I used 4096 as my BUFSIZ but I assume that any multiple of 512 should perform well.
John Scipione
+1  A: 

the size of each read need to be a multiple of 512 ( sector size ) 4096 is a good one

Arabcoder