ansaurus

Question

Uploading large files to a cluster of servers

Answer 1

A:

Why don't you just have an automated process of some kind (using cron, say) perform the syncing for you?

You can have a cron job monitoring a "Drop box" directory (or directories), and then it can run a script to perform the replication for you.

Or you can have the users submit the file with some meta data in order to better route the file once it's uploaded.

Simply, never let the users "choose" where it goes, rather have them tell you "what it's for" and then you have you scripts "know" where things go and how to get them there.

It's a fairly straight forward web app to do, even with just some perl CGI or whatever. And the back end plumbing is straightforward as well.

Answering comment...

If you have a web app performing the upload to CGI, then you typically don't even get "control" of the request until after the file has been fully uploaded. Kind of depends on what server side tech you use. In any case, it's easy to "know" with a web app when the file is fully uploaded. Then your sync process can rely solely on the meta-data to actually do the work on the file, and you don't create the meta-data until after you have moved the file in to the appropriate staging area, etc.

If you are simply using FTP or scp to copy up files in to staging directories, then the solution there is two have two processes. The first monitors the incoming directory, the second actually copies files.

The first process can simply look like this:

cd /your/upload/dir
ls -l > /tmp/newfiles
comm -12 /tmp/lastfiles /tmp/newfiles > /tmp/samefiles
filelist=`awk '{print $9}' /tmp/samefiles`
mv $filelist /your/copy/dir
mv /tmp/newfiles /tmp/lastfiles

This works like this:

Grabs a list of the current files in the incoming upload directory.
Uses comm(1) to get the files that have not changed since the last time the process was run.
Uses awk(1) to get the unchanged file names.
Uses mv(1) to move the files to your "staging" directory.
Finally, it takes the current list of files, and makes it the last list for the next run.

The magic here is comm(1). 'comm -12 filea fileb' gives you a file containing lines that are the same between the two files. If a new file is coming in, then its size will change as it is uploaded, so when you run 'ls -l' the next minute, it's line won't match the new line -- the size (minimally) will be different. So, comm will only find files who dates, filenames, and sizes have not changed. Once you have that list, the rest is pretty straightforward.

The only assumption that this process makes is simply that your filenames don't have spaces in them (thus awk will work easily to get the file name from the list). If you allow spaces, you'll need a slightly more clever mechanism to convert an 'ls -l' line in to the file name.

Also, the 'mv $filelist /your/copy/dir' assumes no spaces in the file names, so it too would need to be modified (you could roll it in to the awk script, having it make a system() call, perhaps).

The second process is also simple:

cd /your/copy/dir
for i in *
do
    sync $i
    mv $i /your/file/youve/copied/dir
done

Again, the "no spaces in filenames assumption" here. This process relies on a sync shell script that you've written that Does The Right Thing. That's left as an exercise for the reader.

Once synced, it moves the file to another directory. Any files that show up there have been "synced" properly. You could also simply delete the file, but I tend to not do that. I'd put that directory perhaps on the "delete files older than a week" program. This way if you encounter a problem, you still have the original files someplace that you can recover with.

This stuff is pretty simple, but it's also robust.

As long as the first process runs "slower" than the uploads (i.e. if you run it twice in a row, you're assured that the file size at least will change), then the run time can be every 1 minute, every hour, every day, whatever. At a minimum, it's safely restartable, and self recovering.

The dark side of the second process is if your sync process take longer than your schedule cron. If you run it every minute, and it takes more than one minute to run, you'll have two processes copying the same files.

If you sync process is "safe", you'll end up just copying the files twice...a waste, but usually harmless.

You can mitigate that by using a technique like this to ensure that your copy script doesn't run more than one at a time.

That's the meat of it. You can also use a combination (using a web app to upload with the meta data, and using the syncing process running automatically via cron).

You can also have a simple web page the lists all of the files in the /your/copy/dir so folks can see if their files have been synced yet. If the file is in this directory, it hasn't completed syncing yet.

Will Hartung 2009-08-29 00:57:39

This is an interesting approach, the only problem I could see is if there is some auto cron that runs automatically then it could be trying to sync files when they're only halfway uploaded.Would need some sort of flag, maybe the meta data file would have to be uploaded afterward the main files. I guess the topic was 'Uploading large....' but I would probably need to some how handle deleting files.

Wizzard 2009-08-29 10:29:20

Answer 2

A:

Put the stuff into a directory meant just for uploads. Then use rsync to copy it out to different servers. Don't worry about moving the files somewhere later on. Rsync will use file size + modification time to tell if it needs to copy a file from your dropbox to other servers.

Your script would be

#!/bin/bash

servers="monkey cow turtle"

for s in $servers ; do
    rsync -r /path/to/dropbox $s:/place/to/putit
done

that can be started by hand or run though cron. You could make it create/check a PID file so only one of itself will run, kick of syncing to servers in parallel if you want, etc. If a file was "halfway uploaded" the first time the script ran, it would be completed the second time automatically.

Shin 2009-09-04 21:36:17

Thanks I have something very similar to that atm. However the problem is how to decide where files in the upload (dropbox) end up on the servers. As there are several sites with multiple folders. I could setup a bunch of defaults, (all pdfs go here etc) but there will always be exceptions

Wizzard 2009-09-06 03:48:13

You'd need additional meta data (like Will said) to figure out where to put things. Or, another possiblility would be to normalize your directory structures used on each system so there's no question.

Shin 2009-09-09 01:39:21

ansaurus

tags:

views:

answers:

Uploading large files to a cluster of servers

related questions