tags:

views:

212

answers:

9

I have about 750,000,000 files I need to store on disk. What's more is I need to be able to access these files randomly--any given file at any time--in the shortest time possible. What do I need to do to make accessing these files fastest?

Think of it like a hash table, only the hash keys are the filenames and the associated values are the files' data.

A coworker said to organize them into directories like this: if I want to store a file named "foobar.txt" and it's stored on the D: drive, put the file in "D:\f\o\o\b\a\r.\t\x\t". He couldn't explain why this was a good idea though. Is there anything to this idea?

Any ideas?

The crux of this is finding a file. What's the fastest way to find a file by name to open?

EDIT:

  • I have no control over the file system upon which this data is stored. It's going to be NTFS or FAT32.
  • Storing the file data in a database is not an option.
  • Files are going to be very small--maximum of probably 1 kb.
  • The drives are going to be solid state.
  • Data access is virtually random, but I could probably figure out a priority for each file based on how often it is requested. Some files will be accessed much more than others.
  • Items will constantly be added, and sometimes deleted.
  • It would be impractical to consolidate multiple files into single files because there's no logical association between files.
  • I would love to gather some metrics by running tests on this stuff, but that endeavour could become as consuming as the project itself!
  • EDIT2:

    I want to upvote several thorough answers, whether they're spot-on or not, and cannot because of my newbie status. Sorry guys!

    A: 

    Is there any relation between individual files? As far as access times go, what folders you put things in won't affect much; the physical locations on the disk are what matter.

    Amber
    A: 

    This sounds like it's going to be largely a question of filesystem choice. One option to look at might be ZFS, it's designed for high volume applications.

    You may also want to consider using a relational database for this sort of thing. 750 million rows is sort of a medium size database, so any robust DBMS (eg. PostgreSQL) would be able to handle it well. You can store arbitrary blobs in the database too, so whatever you were going to store in the files on disk you can just store in the database itself.

    Update: Your additional information is certainly helpful. Given a choice between FAT32 and NTFS, then definitely choose NTFS. Don't store too many files in a single directory, 100,000 might be an upper limit to consider (although you will have to experiment, there's no hard and fast rule). Your friend's suggestion of a new directory for every letter is probably too much, you might consider breaking it up on every four letters or something. The best value to choose depends on the shape of your dataset.

    The reason breaking up the name is a good idea is that typically the performance of filesystems decreases as the number of files in a directory increases. This depends highly on the filesystem in use, for example FAT32 will be horrible with probably only a few thousand files per directory. You don't want to break up the filenames too much, so you will minimise the number of directory lookups the filesystem will have to do.

    Greg Hewgill
    The database solution will work well but might not be faster. I would be very wary of guessing without doing some tests first. Finding a file via a DB index means using a search tree. The suggested solution of a directory based trie implementation also allows Olog(n) access via a tree, but breaking it up by letters means you don't have as much control as to how the nodes are split. Patterns in the filenames could result in a huge node.
    J. Loomis
    Right, I wouldn't attempt to claim that a database would be faster, but it's another option that should be considered. However, databases are designed to handle string type keys with arbitrary pathological patterns. :)
    Greg Hewgill
    A: 

    Why isn't storing the paths in a database table acceptable?

    Raj
    A: 

    My guess is he is thinking of a Trie data structure to create on disk where the node is a directory.

    Scanningcrew
    +1  A: 

    This highly depends on many factors:

    • What file system are you using?
    • How large is each file?
    • What type of drives are you using?
    • What are the access patterns?

    Accessing files purely at random is really expensive in traditional disks. One significant improvement you can get is to use solid state drive.

    If you can reason an access pattern, you might be able to leverage locality of reference to place these files.

    Another possible way is to use a database system, and store these files in the database to leverage the system's caching mechanism.

    Update:

    Given your update, is it possbile you consolidate some files? 1k files are not very efficient to store as file systems (fat32, ntfs) have cluster size and each file will use the cluster size anyway even if it is smaller than the cluster size. There is usually a limit on the number of files in each folder, with performance concerns. You can do a simple benchmark by putting as many as 10k files in a folder to see how much performance degrades.

    If you are set to use the trie structure, I would suggest survey the distribution of file names and then break them into different folders based on the distribution.

    rxin
    A: 

    I'd check out hadoops model.

    P

    Paul
    +1  A: 

    This depends to a large extent on what file system you are going to store the files on. The capabilities of file systems in dealing with large number of files varies widely.

    Your coworker is essentially suggesting the use of a Trie data structure. Using such a directory structure would mean that at each directory level there are only a handful of files/directories to choose from; this could help because as the number of files within a directory increases the time to access one of them does too (the actual time difference depends on the file system type.)

    That said, I personally wouldn't go that many levels deep -- three to four levels ought to be enough to give the performance benefits -- most levels after that will probably have very entries (assuming your file names don't follow any particular patterns.)

    Also, I would store the file itself with its entire name, this will make it easier to traverse this directory structure manually also, if required.

    So, I would store foobar.txt as f/o/o/b/foobar.txt

    Siddhartha Reddy
    A: 

    First of all, the file size is very small. Any File System will eat something like at least 4 times more space. I mean any file on disk will occupy 4kb for 1kb file. Especially on SSD disks, the 4kb sector will be the norm.

    So you have to group several files into 1 physical file. 1024 file in 1 storage file seems reasonable. To locate the individual files in these storage files you have to use some RDBMS (PostgreSQL was mentioned and it is good but SQLite may be better suited to this) or similar structure to do the mapping.

    The directory structure suggested by your friend sounds good but it does not solve the physical storage problem. You may use similar directory structure to store the storage files. It is better to name them by using a numerical system.

    If you can, do not let them format as FAT32, at least NTFS or some recent File System of Unix flavor. As total size of the files is not that big, NTFS may be sufficient but ZFS is the better option...

    Malkocoglu
    +1  A: 

    That file algorithm will work, but it's not optimal. I would think that using 2 or 3 character "segments" would be better for performance - especially when you start considering doing backups.

    For example:
    d:\storage\fo\ob\ar\foobar.txt
    or
    d:\storage\foo\bar\foobar.txt

    There are some benefits to using this sort of algorithm:

    1. No database access is necessary.
    2. Files will be spread out across many directories. If you don't spread them out, you'll hit severe performance problems. (I vaguely recall hearing about someone having issues at ~40,000 files in a single folder, but I'm not confident in that number.)
    3. There's no need to search for a file. You can figure out exactly where a file will be from the file name.
    4. Simplicity. You can very easily port this algorithm to just about any language.

    There are some down-sides to this too:

    1. Many directories may lead to slow backups. Imagine doing recursive diffs on these directories.
    2. Scalability. What happens when you run out of disk space and need to add more storage?
    3. Your file names cannot contain spaces.
    Matthew Cole