views:

437

answers:

8

I am trying to remove a large number of files from a location (by large I mean over 100000), whereby the action is initated from a web page. Obviously I could just use

string[] files = System.IO.Directory.GetFiles("path with files to delete");
foreach (var file in files) {
    IO.File.Delete(file);
}

Directory.GetFiles http://msdn.microsoft.com/en-us/library/wz42302f.aspx

This method has already been posted a few times: http://stackoverflow.com/questions/1288718/c-how-to-delete-all-files-and-folders-in-a-directory and http://stackoverflow.com/questions/1620366/c-delete-files-from-directory-if-filename-contains-a-certain-word

But the problem with this method is that if you have say a hundred thousand files it becomes a performance issue as it has to generate all of the filepaths first before looping through them.

Added to this if a web page is waiting a response from a method which is performing this as you can imagine it will look a bit rubbish!

One thought I had was to wrap this up in an an asychrnonous web service call and when it completes it fires back a response to the web page to say that they have been removed? Maybe put the delete method in a separate thread? Or maybe even use a seperate batch process to perform the delete?

I have a similar issue when trying to count the number of files in a directory - if it contains a large number of files.

I was wondering if this is all a bit overkill? I.e. is there a simpler method to deal with this? Any help would be appreciated.

+1  A: 

Do it in a separate thread, or post a message to a queue (maybe MSMQ?) where another application (maybe a windows service) is subscribed to that queue and performs the commands (i.e. "Delete e:\dir*.txt") in it's own process.

The message should probably just include the folder name. If you use something like NServiceBus and transactional queues, then you can post your message and return immediately as long as the message was posted successfully. If there is a problem actually processing the message, then it'll retry and eventually go on an error queue that you can watch and perform maintenance on.

Neil Barnwell
Yep definitely separate thread!! :) I like your idea about using MSMQ! Will investigate and reply back!
Aim Kai
No, I don't recommend using another thread on an IIS app pool, I recommend a totally separate process, where you use something like MSMQ (i.e. with NServiceBus) to send that process a message telling it to perform the deletion. If you use NSB and transactional MSMQ queues, then you have safety all the way through that the message has been processed.
Neil Barnwell
Sorry I misunderstood you.. :)!
Aim Kai
A: 

Boot the work out to a worker thread and then return your response to the user.

I'd flag up a application variable to say that you are doing "the big delete job" to stop running multiple threads doing the same work. You could then poll another page which could give you a progress update of the number of files removed so far too if you wanted to?

Just a query but why so many files?

Pete Duncanson
100k files is not much, i currently work on an application that shuffles around 2-3 million files that are (by spec) splitted into directories of 100k-150k files. rsync requires 60 minutes for a dry run.
dbemerlin
Its alot to be doing via a link/button on a website is all I meant :)
Pete Duncanson
+3  A: 
  1. GetFiles is extremely slow.
  2. If you are invoking it from a website, you might just throw a new Thread which does this trick.
  3. An ASP.NET AJAX call that returns whether there are still matching files, can be used to do basic progress updates.

Below an implementation of a fast Win32 wrapping for GetFiles, use it in combination with a new Thread and an AJAX function like: GetFilesUnmanaged(@"C:\myDir", "*.txt*).GetEnumerator().MoveNext().

Usage

Thread workerThread = new Thread(new ThreadStart((MethodInvoker)(()=>
{    
     foreach(var file in GetFilesUnmanaged(@"C:\myDir", "*.txt"))
          File.Delete(file);
})));
workerThread.Start();
//just go on with your normal requests, the directory will be cleaned while the user can just surf around

   public static IEnumerable<string> GetFilesUnmanaged(string directory, string filter)
        {
            return new FilesFinder(Path.Combine(directory, filter))
                .Where(f => (f.Attributes & FileAttributes.Normal) == FileAttributes.Normal
                    || (f.Attributes & FileAttributes.Archive) == FileAttributes.Archive)
                .Select(s => s.FileName);
        }
    }


public class FilesEnumerator : IEnumerator<FoundFileData>
{
    #region Interop imports

    private const int ERROR_FILE_NOT_FOUND = 2;
    private const int ERROR_NO_MORE_FILES = 18;

    [DllImport("kernel32.dll", SetLastError = true, CharSet = CharSet.Auto)]
    private static extern IntPtr FindFirstFile(string lpFileName, out WIN32_FIND_DATA lpFindFileData);

    [DllImport("kernel32.dll", SetLastError = true, CharSet = CharSet.Auto)]
    private static extern bool FindNextFile(SafeHandle hFindFile, out WIN32_FIND_DATA lpFindFileData);

    #endregion

    #region Data Members

    private readonly string _fileName;
    private SafeHandle _findHandle;
    private WIN32_FIND_DATA _win32FindData;

    #endregion

    public FilesEnumerator(string fileName)
    {
        _fileName = fileName;
        _findHandle = null;
        _win32FindData = new WIN32_FIND_DATA();
    }

    #region IEnumerator<FoundFileData> Members

    public FoundFileData Current
    {
        get
        {
            if (_findHandle == null)
                throw new InvalidOperationException("MoveNext() must be called first");

            return new FoundFileData(ref _win32FindData);
        }
    }

    object IEnumerator.Current
    {
        get { return Current; }
    }

    public bool MoveNext()
    {
        if (_findHandle == null)
        {
            _findHandle = new SafeFileHandle(FindFirstFile(_fileName, out _win32FindData), true);
            if (_findHandle.IsInvalid)
            {
                int lastError = Marshal.GetLastWin32Error();
                if (lastError == ERROR_FILE_NOT_FOUND)
                    return false;

                throw new Win32Exception(lastError);
            }
        }
        else
        {
            if (!FindNextFile(_findHandle, out _win32FindData))
            {
                int lastError = Marshal.GetLastWin32Error();
                if (lastError == ERROR_NO_MORE_FILES)
                    return false;

                throw new Win32Exception(lastError);
            }
        }

        return true;
    }

    public void Reset()
    {
        if (_findHandle.IsInvalid)
            return;

        _findHandle.Close();
        _findHandle.SetHandleAsInvalid();
    }

    public void Dispose()
    {
        _findHandle.Dispose();
    }

    #endregion
}

public class FilesFinder : IEnumerable<FoundFileData>
{
    readonly string _fileName;
    public FilesFinder(string fileName)
    {
        _fileName = fileName;
    }

    public IEnumerator<FoundFileData> GetEnumerator()
    {
        return new FilesEnumerator(_fileName);
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

public class FoundFileData
{
    public string AlternateFileName;
    public FileAttributes Attributes;
    public DateTime CreationTime;
    public string FileName;
    public DateTime LastAccessTime;
    public DateTime LastWriteTime;
    public UInt64 Size;

    internal FoundFileData(ref WIN32_FIND_DATA win32FindData)
    {
        Attributes = (FileAttributes)win32FindData.dwFileAttributes;
        CreationTime = DateTime.FromFileTime((long)
                (((UInt64)win32FindData.ftCreationTime.dwHighDateTime << 32) +
                 (UInt64)win32FindData.ftCreationTime.dwLowDateTime));

        LastAccessTime = DateTime.FromFileTime((long)
                (((UInt64)win32FindData.ftLastAccessTime.dwHighDateTime << 32) +
                 (UInt64)win32FindData.ftLastAccessTime.dwLowDateTime));

        LastWriteTime = DateTime.FromFileTime((long)
                (((UInt64)win32FindData.ftLastWriteTime.dwHighDateTime << 32) +
                 (UInt64)win32FindData.ftLastWriteTime.dwLowDateTime));

        Size = ((UInt64)win32FindData.nFileSizeHigh << 32) + win32FindData.nFileSizeLow;
        FileName = win32FindData.cFileName;
        AlternateFileName = win32FindData.cAlternateFileName;
    }
}

/// <summary>
/// Safely wraps handles that need to be closed via FindClose() WIN32 method (obtained by FindFirstFile())
/// </summary>
public class SafeFindFileHandle : SafeHandleZeroOrMinusOneIsInvalid
{
    [DllImport("kernel32.dll", SetLastError = true)]
    private static extern bool FindClose(SafeHandle hFindFile);

    public SafeFindFileHandle(bool ownsHandle)
        : base(ownsHandle)
    {
    }

    protected override bool ReleaseHandle()
    {
        return FindClose(this);
    }
}

// The CharSet must match the CharSet of the corresponding PInvoke signature
[StructLayout(LayoutKind.Sequential, CharSet = CharSet.Auto)]
public struct WIN32_FIND_DATA
{
    public uint dwFileAttributes;
    public FILETIME ftCreationTime;
    public FILETIME ftLastAccessTime;
    public FILETIME ftLastWriteTime;
    public uint nFileSizeHigh;
    public uint nFileSizeLow;
    public uint dwReserved0;
    public uint dwReserved1;
    [MarshalAs(UnmanagedType.ByValTStr, SizeConst = 260)]
    public string cFileName;
    [MarshalAs(UnmanagedType.ByValTStr, SizeConst = 14)]
    public string cAlternateFileName;
}
Jan Jongboom
If it takes a long time, the HTTP request can still timeout, though.
Neil Barnwell
I think the idea of wrapping an unmanaged chunk of code is certainly one way to go. But I still have a problem with working out if the process has finished or not. I guess I could put this in an web service call! Thanks for the response though Jan - I'll have a look at this code.. :)
Aim Kai
You can determine whether the process is finished, using an ASP.NET AJAX call that calls `GetFilesUnmanaged(@"C:\myDir", "*.txt").GetEnumerator().MoveNext()`; it's a very cheap call in contrary to default `GetFiles`, and if it returns something; the process hasn't finished yet :-).
Jan Jongboom
Great thanks for that :)
Aim Kai
A: 

You could create a simple ajax webmethod in your aspx code behind and call it with javascript.

Steve Danner
If it takes a long time, the HTTP request can still timeout, though.
Neil Barnwell
Yes I did think of an ajax webmethod - but this isn't the solution if I use the GetFiles method..
Aim Kai
A: 

The best choice (imho) would be to create a seperate process to delete/count the files and check on the progress by polling otherwise you might get problems with browser timeouts.

dbemerlin
A: 

Wow. I think you are definitely on the right track with having some other service or entity taking care of the delete. In doing so you could also provide methods for tracking the process of the delete and showing the result to the user using asynch javascript.

As others have said putting this in another process is a great idea. You do not want IIS hogging resources using such long running jobs. Another reason for doing so is security. You might not want to give your work process that ability to delete files from the disk.

smaclell
+3  A: 

Can you put all your files in the same directory?

If so, why don't you just call Directory.Delete(string,bool) on the subdir you want to delete?

If you've already got a list of file paths you want to get rid of, you might actually get better results by moving them to a temp dir then deleting them rather than deleting each file manually.

Cheers, Florian

Florian Doyon
Would I have to use the System.IO.Directory.GetFiles() method to get all the files I have to move? as in the following example?http://msdn.microsoft.com/en-us/library/cc148994.aspx or http://www.eggheadcafe.com/community/aspnet/2/63950/moving-files-from-one-fol.aspxThis would just cause the same performance issue I was talking about above wouldn't it? I guess alternatively I could use a script such as rmdir <dirname> /q /s - might be worth looking into?
Aim Kai
I think it would cause a perf slowdown, but not as drastic as deleting all the files one by one.Moving a file is very cheap, deleting it not so, so you should still gain some perfs by moving the files to the directory that you will then delete.The best approach would be to actually create the files in the same directory in the first place, if you can find any way to group the files according to the way they're going to be deleted when you get them.
Florian Doyon
Yep agree with you on the performance difference between moving and deleting files. Unfortunately the creation of the files is not directly under my control at the moment..
Aim Kai
A: 

Having more than 1000 files in a directory is a huge problem.

If you are in the development stages now, you should consider putting in an algo which will put the files into a random folder (inside your root folder) with a surety of the number of files in that folder to be under 1024.

Something like

public UserVolumeGenerator()
    {
        SetNumVolumes((short)100);
        SetNumSubVolumes((short)1000);
        SetVolumesRoot("/var/myproj/volumes");
    }

    public String GenerateVolume()
    {
        int volume = random.nextInt(GetNumVolumes());
        int subVolume = random.nextInt(GetNumSubVolumes());

        return Integer.toString(volume) + "/" + Integer.toString(subVolume);
    }

    private static final Random random = new Random(System.currentTimeMillis());

While doing this, also make sure that each time you create a file, add it to a HashMap or list simultaneously (the path). Periodically serialize this using something like JSON.net to the filesystem(integrity’s sake, so that even if your service fails, you can get back the file list from the serialized form).

When you want to clean up the files or query among them, first do a lookup of this HashMap or list and then act on the file. This is better than System.IO.Directory.GetFiles

Cherian