views:

137

answers:

4

I'm creating a python script which accepts a path to a remote file and an n number of threads. The file's size will be divided by the number of threads, when each thread completes I want them to append the fetch data to a local file.

How do I manage it so that the order in which the threads where generated will append to the local file in order so that the bytes don't get scrambled?

Also, what if I'm to download several files simultaneously?

A: 

You need to fetch completely separate parts of the file on each thread. Calculate the chunk start and end positions based on the number of threads. Each chunk must have no overlap obviously.

For example, if target file was 3000 bytes long and you want to fetch using three thread:

  • Thread 1: fetches bytes 1 to 1000
  • Thread 2: fetches bytes 1001 to 2000
  • Thread 3: fetches bytes 2001 to 3000

You would pre-allocate an empty file of the original size, and write back to the respective positions within the file.

Mitch Wheat
A: 

You can use a thread safe "semaphore", like this:

class Counter:
  counter = 0
  @classmethod
  def inc(cls):
    n = cls.counter = cls.counter + 1 # atomic increment and assignment
    return n

Using Counter.inc() returns an incremented number across threads, which you can use to keep track of the current block of bytes.

That being said, there's no need to split up file downloads into several threads, because the downstream is way slower than the writing to disk, so one thread will always finish before the next one is downloading.

The best and least resource hungry way is simply to have a download file descriptor linked directly to a file object on disk.

Tor Valamo
The reason for threads is to multiplex the download. Multiple concurrent TCP sessions will deliver much higher throughput than a single session, even on a low-bandwidth DSL account (especially on that!). You have to synchronise because you don't control the order in which the concurrent fetches will complete.
Marcelo Cantos
... 'aight. ;-)
Tor Valamo
+6  A: 

You could coordinate the works with locks &c, but I recommend instead using Queue -- usually the best way to coordinate multi-threading (and multi-processing) in Python.

I would have the main thread spawn as many worker threads as you think appropriate (you may want to calibrate between performance, and load on the remote server, by experimenting); every worker thread waits at the same global Queue.Queue instance, call it workQ for example, for "work requests" (wr = workQ.get() will do it properly -- each work request is obtained by a single worker thread, no fuss, no muss).

A "work request" can in this case simply be a triple (tuple with three items): identification of the remote file (URL or whatever), offset from which it is requested to get data from it, number of bytes to get from it (note that this works just as well for one or multiple files ot fetch).

The main thread pushes all work requests to the workQ (just workQ.put((url, from, numbytes)) for each request) and waits for results to come to another Queue instance, call it resultQ (each result will also be a triple: identifier of the file, starting offset, string of bytes that are the results from that file at that offset).

As each working thread satisfies the request it's doing, it puts the results into resultQ and goes back to fetch another work request (or wait for one). Meanwhile the main thread (or a separate dedicated "writing thread" if needed -- i.e. if the main thread has other work to do, for example on the GUI) gets results from resultQ and performs the needed open, seek, and write operations to place the data at the right spot.

There are several ways to terminate the operation: for example, a special work request may be asking the thread receiving it to terminate -- the main thread puts on workQ just as many of those as there are working threads, after all the actual work requests, then joins all the worker threads when all data have been received and written (many alternatives exist, such as joining the queue directly, having the worker threads daemonic so they just go away when the main thread terminates, and so forth).

Alex Martelli
really nice, thanks!
Marconi
A: 

for "download several files simultaneously", I recommond this article: Practical threaded programming with Python . It provides a simultaneously download related example by combining threads with Queues, I thought it's worth a reading.

sunqiang