views:

118

answers:

3

I work at a large university and much of my department's backup requirements are provided by central network services. However, many of the users have collections of large files such as medical imaging scans, which exceed the central storage available to them.

I am seeking to provide an improved backup solution for departmental resources and have set up a Linux server where staff can deposit these collections. However, I can forsee the storage in the server being swamped by large collections of files that are rarely accessed. I have a system in mind to deal with this but want to make sure I am not reinventing the wheel.

My concept:

  1. Users copy files to the server.
  2. Scheduled jobs keep a complete up-to-date copy of all files on a separate storage mechanism (a 1TB external drive is presently earmarked for this)
  3. Files that have not been accessed for sometime are cleared from the server but remain on the storage drive, keeping plenty of headroom in the live environment.
  4. A simple interface (probably web-based) gives users access to a list of all their files from which they can request ones they need, which are copied from the storage drive to the live server. Email notification would be sent once the files had been copied over.

This concept is based on a PACS (Picture Archiving and Communication System) that I heard about in a previous job but did not directly use. That used a similar process of "near-line" backup to give access to a huge volume of data while allowing transmission to local machines to occur at times that did not clog up other parts of the network. It is a similar principle to that used by many museums and academic libraries, where their total "data holdings" are much greater than what is presented on direct access shelving.

Is there a simple open source system available that fits my requirements? Are there other systems that use a different paradigm but which might still fit my needs?

+1  A: 

Hi basswulf,

S3 is an interesting idea here. Use cron to sync files that are not accessed for over 1 month up to Amazon's S3, then create a web interface for users to restore the sync'd files back to the server. Send emails before you move files to S3 and after they are restored.

Limitless storage, only pay for what you use. Not quite an existing open-source project, but not too tough to assemble.

If you need good security, wrap the files in GPG encryption before pushing them to Amazon. GPG is very, very safe.

A more expensve alternative is to store all the data locally. If you don't want to buy a large disk cluster or big NAS, you could use HDFS:

And sync to a cluster that behaves similar to S3. You can scale HDFS with commodity hardware. Especially if you have a couple old machines and a fast network already laying around, this could be much cheaper than serious NAS, as well as more scalable in size.

Good luck! I look forward to seeing more answers on this.

mixonic
My qualm with that is that some of these files contain patient-identifiable information. That's why I'm looking to set something up on the local subnet rather than pushing any data out to the cloud (in fact, encryption of the long-term store is another thing I ought to consider, especially on a removable drive).Thanks.
basswulf
Ah, I've worked in med before, you didn't mentioned patient data. I'd do the same as above, but wrap all the files in GPG before pushing them up. With a strong enough key, they should stay safe. Or HDFS. I'm updating the answer now.
mixonic
A: 

-Please- do not upload patient data to S3 (at least not mine).

A: 

Google 'open source "file lifecycle management"'. I'm sorry, I'm only aware of commercial SAN apps, not if there are F/OSS alternatives.

The way the commercial apps work is the filesystem appears normal -- all files are present. However, if the file has not been accessed in a certain period (for us, this is 90 days), the file is moved to secondary storage. That is, all but the first 4094 bytes are moved. After a file is archived, if you seek (read) past byte 4094 there is a slight delay while the file is pulled back in from secondary storage. I'm guessing files smaller than 4094 bytes are never sent to secondary storage, but I'd never thought about it.

The only problem with this scheme is if you happen to have something that tries to scan all of your files (a web search index, for example). That tends to pull everything back from secondary storage, fills up primary, and the IT folks start giving you the hairy eyeball. (I'm, ahem, speaking from some slight experience.)

You might try asking this over on ServerFault.com.

If you're handy, you might be able to come up with a similar approach using cron and shell scripts. You'd have to replace the 4094-byte stuff with symlinks (and note, the below is not tested).

# This is the server's local storage, available via network
SOURCE_STORAGE_PATH=/opt/network/mounted/path

# This is the remote big backup mount
TARGET_STORAGE_PATH=/mnt/remote/drive

# This is the number of days to start archiving files
DAYS_TO_ARCHIVE=90

# Find old files that are not already symlinks, using temp files
# NOTE: You might have to account for spaces in file names
TEMP_FILE=$(mktemp)
find ${SOURCE_STORAGE_PATH} -atime +${DAYS_TO_ARCHIVE} -a -not -type l > ${TEMP_FILE}

# This probably needs to change, if too many files in TEMP_FILE...
# this would be a good point to drop into something like Perl
for FILE in $(cat ${TEMP_FILE}); do
    # split source into path and file name
    BASE_PATH=$(dirname ${FILE});
    FILE_NAME=$(basename ${FILE})

    # path to target
    TARGET_PATH=${TARGET_STORAGE_PATH}/${BASE_PATH}
    # make sure target exists (note -p option to mkdir)
    [ -d "${TARGET_PATH}" ] || mkdir -p ${TARGET_PATH}
    # move source to target
    mv ${FILE} ${TARGET_PATH}
    # replace source with symlink to target
    ln -s ${TARGET_PATH}/${FILE_NAME} ${FILE}
done
Andrew Barnett
Thanks for that - some interesting ideas. I'm planning to let this question sit over the weekend and come back to it on Monday.
basswulf