ansaurus

Question

urlopen error 10045, 'address already in use' while downloading in Python 2.5 on Windows

Answer 1

+1 A:

Thinking outside the box, the problem you seem to be trying to solve has already been solved by a program called rsync. You might look for a Windows implementation and see if it meets your needs.

Jim Garrison 2009-10-02 23:44:41

For a variety of reasons, writing our own version of this in Python is the best option for us. The "killer feature" of rsync is partial downloading -- sending only the portions of files that are different. We're not duplicating that functionality; this simply walks a list and downloads changed files in their entirety. We're reinventing a pretty trivial wheel.

Schof 2009-10-03 00:33:33

For what it's worth, `rsync -W` copies whole files

gnibbler 2009-10-11 21:08:56

You can't really know whether a file has changed on a remote system without either downloading the whole thing and generating a checksum, or running code on the remote system to generate a checksum and then download if the result has changed. I'd agree that rsync or a similar tool (deltacopy on windows, some library that implements rsync, a pure python implementation, whatever) would be your best bet.

Lee B 2009-10-14 10:40:02

As I said (not very clearly) in the question, I'm generating a file list (including hashes) on the server. Basically, I've re-implemented "rsync -W" in 500 lines of Python. For reasons that aren't really relevant to this question, doing that seemed like the best option in this particular case. Of course, this problem may make that a bad idea in retrospect if none of the above suggestions work. (Hey, it works perfectly on OS X/Unix/Linux.)

Schof 2009-10-15 22:41:23

Answer 2

+3 A:

If it is really a resource problem (freeing os socket resources)

try this:

request = urllib2.Request(file_remote_path)
opener = urllib2.build_opener()

retry = 3 # 3 tries
while retry :
    try :
        datastream = opener.open(request)
    except urllib2.URLError, ue:
        if ue.reason.find('10048') > -1 :
            if retry :
                retry -= 1
            else :
                raise urllib2.URLError("Address already in use / retries exhausted")
        else :
            retry = 0
    if datastream :
        retry = 0

outfileobj = open(temp_file_path, 'wb')
try:
    while True:
        chunk = datastream.read(CHUNK_SIZE)
        if chunk == '':
            break
        else:
            outfileobj.write(chunk)
finally:
    outfileobj = outfileobj.close()
    datastream.close()

if you want you can insert a sleep or you make it os depended

on my win-xp the problem doesn't show up (I reached 5000 downloads)

I watch my processes and network with process hacker.

Blauohr 2009-10-08 13:15:28

thanks for the link to process hacker

Natascha 2009-10-14 12:45:16

Answer 3

+1 A:

You should seriously consider copying and modifying this pyCurl example for efficient downloading of a large collection of files.

Jonathan Feinberg 2009-10-09 01:34:51

Answer 4

+1 A:

Instead of opening a new TCP connection for each request you should really use persistent HTTP connections - have a look at urlgrabber (or alternatively, just at keepalive.py for how to add keep-alive connection support to urllib2).

cmeerw 2009-10-11 17:03:06

Answer 5

+1 A:

All indications point to a lack of available sockets. Are you sure that only 6 are in TIME_WAIT status? If you're running so many download operations it's very likely that netstat overruns your terminal buffer. I find that netstat stat overruns my terminal during normal useage periods.

The solution is to either modify the code to reuse sockets. Or introduce a timeout. It also wouldn't hurt to keep track of how many open sockets you have. To optimize waiting. The default timeout on Windows XP is 120 seconds. so you want to sleep for at least that long if you run out of sockets. Unfortunately it doesn't look like there's an easy way to check from Python when a socket has closed and left the TIME_WAIT status.

Given the asynchronous nature of the requests and timeouts, the best way to do this might be in a thread. Make each threat sleep for 2 minutes before it finishes. You can either use a Semaphore or limit the number of active threads to ensure that you don't run out of sockets.

Here's how I'd handle it. You might want to add an exception clause to the inner try block of the fetch section, to warn you about failed fetches.

import time
import threading
import Queue

# assumes url_queue is a Queue object populated with tuples in the form of(url_to_fetch, temp_file)
# also assumes that TotalUrls is the size of the queue before any threads are started.


class urlfetcher(threading.Thread)
    def __init__ (self, queue)
        Thread.__init__(self)
        self.queue = queue


    def run(self)
        try: # needed to handle empty exception raised by an empty queue.
            file_remote_path, temp_file_path = self.queue.get()
            request = urllib2.Request(file_remote_path)
            opener = urllib2.build_opener()
            datastream = opener.open(request)
            outfileobj = open(temp_file_path, 'wb')
            try:
                while True:
                    chunk = datastream.read(CHUNK_SIZE)
                    if chunk == '':
                        break
                    else:
                        outfileobj.write(chunk)
            finally:
                outfileobj = outfileobj.close()
                datastream.close()    
                time.sleep(120)
                self.queue.task_done()

elsewhere:


while url_queue.size() < TotalUrls: # hard limit of available ports.
    if threading.active_threads() < 3975: # Hard limit of available ports
         t = urlFetcher(url_queue)
         t.start()
    else: 
        time.sleep(2)

url_queue.join()

Sorry, my python is a little rusty, so I wouldn't be surprised if I missed something.

EmFi 2009-10-11 18:11:15

ansaurus

tags:

views:

answers:

urlopen error 10045, 'address already in use' while downloading in Python 2.5 on Windows

related questions