views:

228

answers:

5

I'm trying to figure out the best way to store user uploaded files in a file system. The files range from personal files to wiki files. Of course, the DB will point to those files by someway which I have yet to figure out.

Basic Requirements:

  • Fairy Decent Security so People Can't Guess Filenames (Picture001.jpg, Picture002.jpg, Music001.mp3 is a big no no)
  • Easily Backed Up & Mirrorable (I prefer a way so I don't have to copy the entire HDD every single time I want to backup. I like the idea of backing up just the newest items but I'm flexible with the options here.)
  • Scalable to millions of files on multiple servers if needed.
+3  A: 

One technique is to store the data in files named after the hash (SHA1) of their contents. This is not easily guessable, any backup program should be able to handle it, and it easily sharded (by storing hashes starting with 0 on one machine, hashes starting with 1 on the next, etc).

The database would contain a mapping between the user's assigned name and the SHA1 hash of the contents.

Greg Hewgill
Hash the contents? Wouldn't want to do that, too resource intensive...
KristoferA - Huagati.com
Why not? It's going to be a lot faster than the rate at which files are uploaded in any case. I have used this technique successfully in a high volume application in the past.
Greg Hewgill
+1  A: 

SHA1 hash of the filename + a salt (or, if you want, of the file contents. That makes detecting duplicate files easier, but also puts a LOT more stress on the server). This may need some tweaking to be unique (i.e. add Uploaded UserID or a Timestamp), and the salt is to make it not guessable.

Folder structure is then by parts of the hash.

For example, if the hash is "2fd4e1c67a2d28fced849ee1bb76e7391b93eb12" then the folders could be:

/2
/2/2f/
/2/2f/2fd/
/2/2f/2fd/2fd4e1c67a2d28fced849ee1bb76e7391b93eb12

This is to prevent large folders (some Operating Systems have trouble enumarating folders with a million of files, hence making a few subfolders for parts of the hash. How many levels? That depends on how many files you expect, but 2 or 3 is usually reasonable.

Michael Stum
A: 

Just in terms of one aspect of your question (security): the best way to safely store uploaded files in a filesystem is to ensure the uploaded files are out of the webroot (i.e., you can't access them directly via a URL - you have to go through a script).

This gives you complete control over what people can download (security) and allows for things such as logging. Of course, you have to ensure the script itself is secure, but it means only the people you allow will be able to download certain files.

Phill Sacre
+3  A: 

Guids for filenames, automatically expanding folder hierarchy with no more than a couple of thousand files/folders in each folder. Backing up new files is done by backing up new folders.

You haven't indicated what environment and/or programming language you are using, but here's a C# / .net / Windows example:

using System;
using System.IO;
using System.Xml.Serialization;

/// <summary>
/// Class for generating storage structure and file names for document storage.
/// Copyright (c) 2008, Huagati Systems Co.,Ltd. 
/// </summary>

public class DocumentStorage
{
    private static StorageDirectory _StorageDirectory = null;

    public static string GetNewUNCPath()
    {
        string storageDirectory = GetStorageDirectory();
        if (!storageDirectory.EndsWith("\\"))
        {
            storageDirectory += "\\";
        }
        return storageDirectory + GuidEx.NewSeqGuid().ToString() + ".data";
    }

    public static void SaveDocumentInfo(string documentPath, Document documentInfo)
    {
        //the filestream object don't like NTFS streams so this is disabled for now...
        return;

        //stores a document object in a separate "docinfo" stream attached to the file it belongs to
        //XmlSerializer ser = new XmlSerializer(typeof(Document));
        //string infoStream = documentPath + ":docinfo";
        //FileStream fs = new FileStream(infoStream, FileMode.Create);
        //ser.Serialize(fs, documentInfo);
        //fs.Flush();
        //fs.Close();
    }

    private static string GetStorageDirectory()
    {
        string storageRoot = ConfigSettings.DocumentStorageRoot;
        if (!storageRoot.EndsWith("\\"))
        {
            storageRoot += "\\";
        }

        //get storage directory if not set
        if (_StorageDirectory == null)
        {
            _StorageDirectory = new StorageDirectory();
            lock (_StorageDirectory)
            {
                string path = ConfigSettings.ReadSettingString("CurrentDocumentStoragePath");
                if (path == null)
                {
                    //no storage tree created yet, create first set of subfolders
                    path = CreateStorageDirectory(storageRoot, 1);
                    _StorageDirectory.FullPath = path.Substring(storageRoot.Length);
                    ConfigSettings.WriteSettingString("CurrentDocumentStoragePath", _StorageDirectory.FullPath);
                }
                else
                {
                    _StorageDirectory.FullPath = path;
                }
            }
        }

        int fileCount = (new DirectoryInfo(storageRoot + _StorageDirectory.FullPath)).GetFiles().Length;
        if (fileCount > ConfigSettings.FolderContentLimitFiles)
        {
            //if the directory has exceeded number of files per directory, create a new one...
            lock (_StorageDirectory)
            {
                string path = GetNewStorageFolder(storageRoot + _StorageDirectory.FullPath, ConfigSettings.DocumentStorageDepth);
                _StorageDirectory.FullPath = path.Substring(storageRoot.Length);
                ConfigSettings.WriteSettingString("CurrentDocumentStoragePath", _StorageDirectory.FullPath);
            }
        }

        return storageRoot + _StorageDirectory.FullPath;
    }

    private static string GetNewStorageFolder(string currentPath, int currentDepth)
    {
        string parentFolder = currentPath.Substring(0, currentPath.LastIndexOf("\\"));
        int parentFolderFolderCount = (new DirectoryInfo(parentFolder)).GetDirectories().Length;
        if (parentFolderFolderCount < ConfigSettings.FolderContentLimitFolders)
        {
            return CreateStorageDirectory(parentFolder, currentDepth);
        }
        else
        {
            return GetNewStorageFolder(parentFolder, currentDepth - 1);
        }
    }

    private static string CreateStorageDirectory(string currentDir, int currentDepth)
    {
        string storageDirectory = null;
        string directoryName = GuidEx.NewSeqGuid().ToString();
        if (!currentDir.EndsWith("\\"))
        {
            currentDir += "\\";
        }
        Directory.CreateDirectory(currentDir + directoryName);

        if (currentDepth < ConfigSettings.DocumentStorageDepth)
        {
            storageDirectory = CreateStorageDirectory(currentDir + directoryName, currentDepth + 1);
        }
        else
        {
            storageDirectory = currentDir + directoryName;
        }
        return storageDirectory;
    }

    private class StorageDirectory
    {
        public string DirectoryName { get; set; }
        public StorageDirectory ParentDirectory { get; set; }
        public string FullPath
        {
            get
            {
                if (ParentDirectory != null)
                {
                    return ParentDirectory.FullPath + "\\" + DirectoryName;
                }
                else
                {
                    return DirectoryName;
                }
            }
            set
            {
                if (value.Contains("\\"))
                {
                    DirectoryName = value.Substring(value.LastIndexOf("\\") + 1);
                    ParentDirectory = new StorageDirectory { FullPath = value.Substring(0, value.LastIndexOf("\\")) };
                }
                else
                {
                    DirectoryName = value;
                }
            }
        }
    }
}
KristoferA - Huagati.com
A: 

Expanding on Phill Sacre's answer, another aspect of security is to use a separate domain name for uploaded files (for instante, Wikipedia uses upload.wikimedia.org), and make sure that domain cannot read any of your site's cookies. This prevents people from uploading a HTML file with a script to steal your users' session cookies (simply setting the Content-Type header isn't enough, because some browsers are known to ignore it and guess based on the file's contents; it can also be embedded in other kinds of files, so it's not trivial to check for HTML and disallow it).

CesarB