tags:

views:

225

answers:

4

I'm writing a program that processes hundreds of files when run. Right now each file and folder is stored in an object I created (it contains the filepath, filetype, filesize, a pointer to an offset in the file, and if it is a directory), and those objects are placed in a NSMutableArray. A big problem with this is at the end of processing all the files, I need to get statistics for all the files in each folder. I am doing this using 2 nested for loops, and the performance is terrible.

My question is this: Is there a more efficient way to store a list of files and folders in cocoa (other than NSMutableArray, sets, etc) so I can quickly access all folders and all objects inside those folders? Is there some structure that will create an array of folders and and array of files and folders located in that parent folder?

+1  A: 

You could use C arrays instead of Cocoa collections. This could be faster.
Some links with performance comparisons:

weichsel
C arrays won't help if he's got O(n²) algorithm.
porneL
+3  A: 

Right now each file and folder is stored in an object I created (it contains the filepath, filetype, filesize, a pointer to an offset in the file, and if it is a directory), and those objects are placed in a NSMutableArray.

That's the correct solution. C arrays are trickier, since you'd have to handle size management yourself and you don't get bounds-checking.

A big problem with this is at the end of processing all the files, I need to get statistics for all the files in each folder. I am doing this using 2 nested for loops, and the performance is terrible.

Have you profiled using Shark and/or Instruments? That's the first thing you should check, if you haven't already. The bottleneck may not be where you think it is. Stop reading this answer (and any other answers) until you have profiled.


If you're currently blocking the main thread with this task, consider using NSOperationQueue instead. For each item in the top level, if it's a file, add an operation that examines the file, and if it's a directory, add an operation that will do the same iteration upon the directory's contents. If you can require Snow Leopard, you'll find blocks handy here, since you won't have to explicitly tell the directory-inventory operation which queue to add the examine-file operations to.

You should probably put a lid on the number of operations the queue will run at once, lest you end up running too many them. Mike Ash has details (that post is about GCD, but as of Snow Leopard, NSOperationQueue is based on GCD).

Assuming that you're displaying a running total in your UI, you can use the main queue to hold (possibly block-based) operations that add new information to the totals. If you're supporting Leopard, you can make your own “main” queue, but you'll have to make the operations run on the main thread yourself.

By the way, if you're totaling file sizes, you should consider whether want to uniquify on inode. If I hard-link a 200 MiB file into three other places, you'll see four files, but they're all really the same file, so they only take up 200 MiB, not 800.

Peter Hosey
loop-in-a-loop with linear search, whether over C array and with help of multicore, still sucks. That's O(n²) algorithm. He needs to build a tree or index (hash) for the array.
porneL
It's O(n²) if you examine the entire list of folders again in the inner loop for every item in the outer loop. That's not what he's doing: For each folder he encounters in the outer loop, he examines each item *in that folder* in the inner loop. Presumably, he is adding them to some sort of queue in order to process all of the files; I suggested turning this into an NSOperationQueue. The performance problem could, as a guess, be paging hell from objects piling up, hence my suggestions of (1) profile and (2) use NSOperationQueue (which implies one autorelease pool per operation).
Peter Hosey
A map table would help with inode uniquification, and a tree would guard against having a folder and one of its ancestors both in the list, but I can't see how either would help in the general case of discrete directory trees and no hard links.
Peter Hosey
+2  A: 

You might also want to consider a tree-like structure. You have a root node that corresponds to the filepath "/". Then root has many children, each for "/System", "/etc", "/Library", "/Users", etc.

When you add a node into this tree, you can have it percolate up and add the filesize of the new node to the parents (so that the tree will always have the correct volume size in the root node). Or you can have it compute the size as you need (recursively, most likely) and return that.

As for retrieving the paths in the first place, you've probably found NSFileManager. You should also take a look at NSDirectoryEnumerator and the lower-level FSGetCatalogInfoBulk.

Dave DeLong
A: 

Use NSMutableDictionary with file's directory as a key, and NSMutableArray of files as object. You'll be able to iterate directories quickly.

You can also split directory using [NSString pathComponents] and use dictionary of dictionaries to hold each part of the path (a tree). You can even mix files and dictionaries in the tree and use [foo isKindOfClass:[NSDictionary class]] to tell them apart.

Here's JSON version of what I'm talking about (which translates nicely to Cocoa classes):

/foo/bar/bazfile & /foo/quzfile =

{"foo": {
   "bar": {
      "bazfile": fileinfo
   },
   "quzfile": fileinfo
}
porneL