views:

60

answers:

4

Hey all. I'm creating an application that is going to be generating and storing millions of images. Before I start on this, I'm wondering if anyone knows if it's better to generate more folders and only keep a few files in each, or should I use a few folders and fill them up with lots of files?

The generator will be written in C++ and the files will be accessed directly via GET requests.

Thanks, Steve

A: 

In terms of speed, manageability etc: go with more folders. If you examine a few big applications, generally, they split up the files in many folders. Most applications and/or file systems doesn't like too many files in one folder. From a programmers point of view, it doesn't matter.

Onkelborg
A: 

As ever, you need to run some tests with various scenarios on your particular deployment platform. Note that you've not mentioned which OS/filesystem etc. that you're running on.

I would generally implement some balance between a deeply nested hierarchy (fast but difficult to manage, possibly), and a flat hierarchy with everything stored in one directory. This latter case has caused me performance problems on most platforms in the past. How much data you need to store and how performant you need your solution will dictate how you structure your directories, and some experimentation will give you pointers here.

Brian Agnew
A: 

Things that come to mind:

Pro "fewer folders"

  • Every folder to be navigated means another click for the user, and another lag while the page loads.
  • If the user is going to navigate all of (or a large portion of the tree) all those extra files are just that many more bytes to be sent. This is trivial compared to the total unless you takes the "many folders" strategy to an extreme, but it suggests there is bound somewhere.

Pro "more folders":

  • Long lists of directory contents will force the user to scroll, or type-ahead, or otherwise interact to find a particular file instead of just selecting it ecause they can take in the page at a glance.
  • A user clicking into folder Foo has to wait for all the items in that directory to be loaded before the page will finish rendering. This can be notable lag and a lot of byte for a user wanting only one image.
  • Every access of a item in a directory involves some time. On old fashioned file systems this was often a O(n) operation. Newer file systems support O(ln(n)) access. how this effects the optimum operation of your system depends on the performance of the file system you plan to use. Also take note of the usual use case (which I presume is looking at a small number of directories rather than spanning the whole tree, no?).

Optimizing against these competing pressures will depend on knowing what a typical use pattern looks like, which means you may have to guess initially.

But just for convenient display on the screen I'd suggest more than a handful and fewer than a hundred entries per directory. Then you can collect statistics and adjust from there.

dmckee
A: 

@dmckee No clicks, as the images all load automatically. Think mapping software.

@Brian Agnew It will run/served on some sort of Linux cloud thing. I'm not an IT guy by any stretch of the imagination, just the programmer. But it will definitely be scaled out to a bunch of machines.

@Onkelborg I concur. My inclination has been to go with more folders and less files, as well. I'm thinking the layout would be something like...

set/zoom-level/column/row.jpg

I wanted to use filename/directory structure to pull files without querying a server. If we're zoomed in by a factor of five and the top left coordinate is 25,600 x 15,360 of this larger image, given a 256 pixel square tile, some basic math would give me this URL:

2389/5/20/12.jpg

Where "2389" is a tile-set ID. So you can see images would only be stored in directories three levels deep. The directories with images would hold maybe 4 - ~100 images based on zoom level. Or maybe a dozen to a few hundred (with slightly less folders), if went this way...

set/zoom-level/row/column.jpg

I came across a similar system that used a similar quad tree system and notice that they had to break out into new folders at odd, non-systemic spots that made me think they did it for performance issues or other limitations.

As I've written this, I think I'm realize that the first layout is probably the way to go. It's less items to iterate through to find requested file. I'm just thinking of fragmentation, but I guess that will be IT's job. ;)