ansaurus

Question

Linux Data Warehouse System for User Files?

Answer 1

+1 A:

Hi basswulf,

S3 is an interesting idea here. Use cron to sync files that are not accessed for over 1 month up to Amazon's S3, then create a web interface for users to restore the sync'd files back to the server. Send emails before you move files to S3 and after they are restored.

http://s3tools.org/s3cmd

Limitless storage, only pay for what you use. Not quite an existing open-source project, but not too tough to assemble.

If you need good security, wrap the files in GPG encryption before pushing them to Amazon. GPG is very, very safe.

A more expensve alternative is to store all the data locally. If you don't want to buy a large disk cluster or big NAS, you could use HDFS:

http://hadoop.apache.org/common/docs/current/hdfs_design.html

And sync to a cluster that behaves similar to S3. You can scale HDFS with commodity hardware. Especially if you have a couple old machines and a fast network already laying around, this could be much cheaper than serious NAS, as well as more scalable in size.

Good luck! I look forward to seeing more answers on this.

mixonic 2009-08-07 11:41:40

My qualm with that is that some of these files contain patient-identifiable information. That's why I'm looking to set something up on the local subnet rather than pushing any data out to the cloud (in fact, encryption of the long-term store is another thing I ought to consider, especially on a removable drive).Thanks.

basswulf 2009-08-07 11:45:32

Ah, I've worked in med before, you didn't mentioned patient data. I'd do the same as above, but wrap all the files in GPG before pushing them up. With a strong enough key, they should stay safe. Or HDFS. I'm updating the answer now.

mixonic 2009-08-07 12:20:27

Answer 2

A:

-Please- do not upload patient data to S3 (at least not mine).

2009-08-07 12:44:30

Answer 3

A:

Google 'open source "file lifecycle management"'. I'm sorry, I'm only aware of commercial SAN apps, not if there are F/OSS alternatives.

The way the commercial apps work is the filesystem appears normal -- all files are present. However, if the file has not been accessed in a certain period (for us, this is 90 days), the file is moved to secondary storage. That is, all but the first 4094 bytes are moved. After a file is archived, if you seek (read) past byte 4094 there is a slight delay while the file is pulled back in from secondary storage. I'm guessing files smaller than 4094 bytes are never sent to secondary storage, but I'd never thought about it.

The only problem with this scheme is if you happen to have something that tries to scan all of your files (a web search index, for example). That tends to pull everything back from secondary storage, fills up primary, and the IT folks start giving you the hairy eyeball. (I'm, ahem, speaking from some slight experience.)

You might try asking this over on ServerFault.com.

If you're handy, you might be able to come up with a similar approach using cron and shell scripts. You'd have to replace the 4094-byte stuff with symlinks (and note, the below is not tested).

# This is the server's local storage, available via network
SOURCE_STORAGE_PATH=/opt/network/mounted/path

# This is the remote big backup mount
TARGET_STORAGE_PATH=/mnt/remote/drive

# This is the number of days to start archiving files
DAYS_TO_ARCHIVE=90

# Find old files that are not already symlinks, using temp files
# NOTE: You might have to account for spaces in file names
TEMP_FILE=$(mktemp)
find ${SOURCE_STORAGE_PATH} -atime +${DAYS_TO_ARCHIVE} -a -not -type l > ${TEMP_FILE}

# This probably needs to change, if too many files in TEMP_FILE...
# this would be a good point to drop into something like Perl
for FILE in $(cat ${TEMP_FILE}); do
    # split source into path and file name
    BASE_PATH=$(dirname ${FILE});
    FILE_NAME=$(basename ${FILE})

    # path to target
    TARGET_PATH=${TARGET_STORAGE_PATH}/${BASE_PATH}
    # make sure target exists (note -p option to mkdir)
    [ -d "${TARGET_PATH}" ] || mkdir -p ${TARGET_PATH}
    # move source to target
    mv ${FILE} ${TARGET_PATH}
    # replace source with symlink to target
    ln -s ${TARGET_PATH}/${FILE_NAME} ${FILE}
done

Andrew Barnett 2009-08-07 12:50:31

Thanks for that - some interesting ideas. I'm planning to let this question sit over the weekend and come back to it on Monday.

basswulf 2009-08-07 13:34:47

ansaurus

tags:

views:

answers:

Linux Data Warehouse System for User Files?

related questions