views:

340

answers:

11

The scenario: We have a python script that checks thousands of proxys simultaneously. The program uses threads, 1 per proxy, to speed the process. When it reaches the 1007 thread, the script crashes because of the thread limit.
My solution is: A global variable that gets incremented when a thread spawns and decrements when a thread finishes. The function which spawns the threads monitors the variable so that the limit is not reached.
What will your solution be, friends?

Thanks for the answers.

+3  A: 

Does Python have any sort of asynchronous IO functionality? That would be the preferred answer IMO - spawning an extra thread for each outbound connection isn't as neat as having a single thread which is effectively event-driven.

Jon Skeet
Well you have put me on a good track, thanks.
+1  A: 

Make sure your threads get destroyed properly after they've been used or use a threadpool, although per what I see they're not that effective in Python

see here:

http://code.activestate.com/recipes/203871/

Tony
+1  A: 

Using the select module or a similar library would most probably be a more efficient solution, but that would require bigger architectural changes.

If you just want to limit the number of threads, a global counter should be fine, as long as you access it in a thread safe way.

sth
+5  A: 

You want to do non-blocking I/O with the select module.

There are a couple of different specific techniques. select.select should work for every major platform. There are other variations that are more efficient (and could matter if you are checking tens of thousands of connections simultaneously) but you will then need to write the code for you specific platform.

R Samuel Klatchko
This could be very good I see the technique in "Asynchronous Socket Programming". Could be much better than the solution with a number of fixed threads.
+2  A: 

Using different processes, and pipes to transfer data. Using threads in python is pretty lame. From what I heard, they don't actually run in parallel, even if you have a multi-core processor... But maybe it was fixed in python3.

Oren
Actually using threads in most languages results in locking into a single processor or having so many memory barriers that the multiple processors run in lock step. It's not really Python specific.
D.Shawley
+1: I would recommend multiple processes over multiple threads though.
D.Shawley
+4  A: 

I've run into this situation before. Just make a pool of Tasks, and spawn a fixed number of threads that run an endless loop which grabs a Task from the pool, run it, and repeat. Essentially you're implementing your own thread abstraction and using the OS threads to implement it.

This does have drawbacks, the major one being that if your Tasks block for long periods of time they can prevent the execution of other Tasks. But it does let you create an unbounded number of Tasks, limited only by memory.

Keith Randall
This sound quite good. I do not have to worry about the thread number. But the tasks is to check proxys which have variable waiting time. I should do some statistics to find the optimal constant of threads.
To be sure I do not block the other tasks.
I guess you have 2 options. Option 1 is to spawn 1006 OS threads. If you really want to know the minimum number of threads you need, just keep adding threads until you get to 100% CPU utilization. That will mean that you've overapped all of your I/O wait.
Keith Randall
+1  A: 

Be careful to minimize the default thread stack size. At least on Linux, the default limit puts severe restrictions on the number of created threads. Linux allocates a chunk of the process virtual address space to the stack (usually 10MB). 300 threads x 10MB stack allocation = 3GB of virtual address space dedicated to stack, and on a 32 bit system you have a 3GB limit. You can probably get away with much less.

pau.estalella
+2  A: 

My solution is: A global variable that gets incremented when a thread spawns and decrements when a thread finishes. The function which spawns the threads monitors the variable so that the limit is not reached.

The standard way is to have each thread get next tasks in a loop instead of dying after processing just one. This way you don't have to keep track of the number of threads, since you just fire a fixed number of them. As a bonus, you save on thread creation/destruction.

Rafał Dowgird
+1  A: 

A counting semaphore should do the trick.

from socket import *
from threading import *

maxthreads = 1000
threads_sem = Semaphore(maxthreads)

class MyThread(Thread):
    def __init__(self, conn, addr):
        Thread.__init__(self)
        self.conn = conn
        self.addr = addr
    def run(self):
        try:
            read = conn.recv(4096)
            if read == 'go away\n':
                global running
                running = False
            conn.close()
        finally:
            threads_sem.release()

sock = socket()
sock.bind(('0.0.0.0', 2323))
sock.listen(1)
running = True
while running:
    conn, addr = sock.accept()
    threads_sem.acquire()
    MyThread(conn, addr).start()
ephemient
+1  A: 

Twisted is a perfect fit for this problem. See http://twistedmatrix.com/documents/current/core/howto/clients.html for a tutorial on writing a client.

If you don't mind using alternate Python implmentations, Stackless has light-weight (non-native) threads. The only company I know doing much with it though is CCP--they use it for tasklets in their game on both the client and server. You still need to do async I/O with Stackless because if a thread blocks, the process blocks.

Ken Fox
Thanks a lot. I will try twisted.
+1  A: 

As mentioned in another thread, why do you spawn off a new thread for each single operation? This is a classical producer - consumer problem, isn't it? Depending a bit on how you look at it, the proxy checkers might be comsumers or producers.

Anyway, the solution is to make a "queue" of "tasks" to process, and make the threads in a loop check if there are any more tasks to perform in the queue, and if there isn't, wait a predefined interval, and check again.

You should protect your queue with some locking mechanisms, i.e. semaphores, to prevent race conditions.

It's really not that difficult. But it requires a bit of thinking getting it right. Good luck!

Erik A. Brandstadmoen