views:

85

answers:

1

I'm working on a project where a user can submit a link to a sound file hosted on another site through a form. I'd like to download that file to my server and make it available for streaming. I might have to upload it to Amazon S3. I'm doing this in Django but I'm new to Python. Can anyone point me in the right direction for how to do this?

A: 

Here's how I would do it:

  1. Create a model like SoundUpload like:

    class SoundUpload(models.Model):
        STATUS_CHOICES = (
            (0, 'Unprocessed'),
            (1, 'Ready'),
            (2, 'Bad File'),
        )
        uploaded_by = models.ForeignKey(User)
        original_url = models.URLField(verify_true=False)
        download_url = models.URLField(null=True, blank=True)
        status = models.IntegerField(choices=STATUS_CHOICES, default=0)
    
  2. Next create the view w/a ModelForm and save the info to the database.

  3. Hook up a post-save signal on the SoundUpload model that kicks of a django-celery Task. This will ensure that the UI responds while you're processing all the data.

    def process_new_sound_upload(sender, **kwargs):
       # Bury to prevent circular dependency issues.
       from your_project.tasks import ProcessSoundUploadTask
       if kwargs.get('created', False):
            instance = kwargs.get('instance')
            ProcessSoundUploadTask.delay(instance.id)
    
    
    post_save.connect(process_new_sound_upload, sender=SoundUpload)
    
  4. In the ProcessSoundUploadTask task you'll want to:

    • Lookup the model object based on the passed in id.
    • Using pycurl download the file to a temporary folder (w/very limitied permissions).
    • Use ffmpeg (or similar) to ensure it's a real sound file. Do any other virus style checks here (depends on how much you trust your users). If it turn out to be a bad file set the SoundUpload.status field to 2 (Bad File), save it, and return to stop processing the task. Perhaps send out an email here.

    • Use boto to upload the file to s3. See this example.

    • Update the SoundUpload.download_url to be the s3 url, the status to be "processed" and save the object.
    • Do any other post-processing (sending notification emails, etc.)

The key to this approach is using django-celery. Once the task is kicked off through the post_save signal the UI can return, thus creating a very "snappy" experience. This task gets put onto an AMQP message queue that can be processed by multiple workers (dedicated EC2 instances, etc.), so you'll be able to scale without too much trouble. This may seem like a bit overkill, but it's really not as much work as it seems.

sdolan
Thank you for a very detailed answer. I will try this approach.
knuckfubuck
@knuckfubuck: You're welcome. I'm happy to answer any issues you may run into as you develop this, just make sure you mark your comments w/my name, so I'll get notified.
sdolan
@sdolan: I tried adding a ChoiceField for status like in your example but I'm getting an error that there is no ChoiceField in models. I did some searching but can't figure out how to fix this. Can you help?
knuckfubuck
@knuckfubuck: Sorry, it's IntegerField (ChoiceField is a forms Field, not models Field). I've updated my answer.
sdolan
@sdolan: OK, I thought it might need to be a CharField or IntegerField. Thanks. Now when I'm hooking up the post_save, which I'm putting under the SoundUpload model, it is telling me SoundUpload is not defined. I tried using 'self' as well but get the same error.
knuckfubuck
@sdolan: Nevermind that last one I had the code nested incorrectly.
knuckfubuck
@sdolan: Got Celery w/ RabbitMQ working and I'm trying to write a task now but I'm having problems importing the models from my app into the tasks.py file. Is there something special I need to do for that file?
knuckfubuck
@knuckfubuck: Perhaps it's circular dependency problem? Try burying your `from your_project.tasks import ProcessSoundUploadTask` inside the `process_new_sound_upload` method so it gets resolved at runtime. Later you'll want to place that code in it's own `signals.py` file.
sdolan
@sdolan: That did it. Thanks for your quick replies.
knuckfubuck
@sdolan: Before I start a new question for this, maybe you can help. Instead of pycurl I want to use http://pyload.org/ to download files to my server but it is over my head on how to implement this. Any ideas?
knuckfubuck
@knuckfubuck: Why do you want to use pyload over pycurl? It doesn't look like the right tool for this sort of thing.
sdolan
@sdolan: The links that will be submitted to me will be mostly One-Click Sharing sites and I will need a way to get past the captchas, which pyload does.
knuckfubuck
@knuckfubuck: I'd be curious to see how reliable it is in breaking captchas. I'd definitely start a new question for this. Though I would highly recommend getting everything working end-to-end with simple downloads before you add in the extra complexity of dealing with captchas and third party sites.
sdolan