views:

467

answers:

6

I'm new to threading and was wondering if it's bad to spawn a lot of threads for various tasks (in a server environment). Do threads take up a lot more memory/cpu compared to more linear programming?

+3  A: 

It depends on the platform.

Windows threads have to commit around 1MB of memory when created. It's better to have some kind of threadpool than spawning threads like a madman, to make sure you never allocate more than a fixed amount of threads. Also, when you work in Python, you're subject to the Global Interpreter Lock which hinders code that rely on lots of concurrent threads.

On Unix, you may consider using different processes instead of threads, or look at other asynchronous way of doing things (the Twisted server framework has interesting ways of handling asynchronous network tasks, and if you feel really adventurous you can take a look at stackless Python, a continuation framework which don't really use kernel threads at all).

Yann Schwartz
+11  A: 

You have to consider multiple things if you want to use multiple threads:

  1. You can only run #processors threads simultaneously. (Obvious)
  2. In Python each thread is a 'kernel thread' which normally takes a non-trivial amount of resources (8 mb stack by default on linux)
  3. Python has a global interpreter lock, which means only one python instructions can be processed at once independently of the number of processors. This lock is however released if one of your threads waits on IO.

The conclusion I take from that:

  1. If you are doing IO (Turbogears, Twisted) or are using properly coded extension modules (numpy) use threads.
  2. If you want to execute python code concurrently use processes (easiest with multiprocess module)
ebo
You're much clearer than I was :-)
Yann Schwartz
It's important to note that if you're not using multiple cores, using multiprocess may make things *slower* in terms of process startup costs and context switching.
Jason Baker
+1: If your algorithm isn't designed for concurrency, threads will do no good. Often, multiprocessing (via a simple Linux pipeline) will get you all the concurrency you can leverage. Many small processes linked by pipes is remarkably efficient and easy to build.
S.Lott
+1  A: 

Threads do have some CPU and memory overhead but unless you spawn hundreds or thousands of them, it usually isn't all that significant. The more important issue is that threads make correct programming a lot more difficult if you share any writable datastructures between concurrent threads. See the paper The Problem with Threads for an explanation why they aren't a good abstraction for concurrent programming.

Ants Aasma
+2  A: 

You might consider using microthreads if you need concurrency. There's a good article on the subject here. The advantage is that you're not creating "real" threads that eat up resources and cause context switching. Of course the downside is that you're not taking advantage of multicore technology.

This is the approach Kamaelia and stackless take.

If you're doing I/O, you might consider using asynchronous I/O as well. This can be a real pain to program, but it prevents you from having threads fight over CPU time. Unfortunately, I don't know of any platform independent way to do this in Python.

Jason Baker
A: 

Excellent answers all around! I just wanted to add that, if you decide to go with a thread pool (generally advisable if you decide that threads are suitable for your app), you would be well advised to reuse (and possibly adapt) an existing general-purpose implementation, such as Christopher Arndt's, rather than roll your own from scratch (which is always an instructive undertaking, to be sure, but less productive in terms of time it takes to get you properly working and debugged code;-).

Alex Martelli
+3  A: 

Since you're new to threading, there's something else worth bearing in mind - which I prefer to view as parallel scoping of values.

With traditional linear/sequential programming for any given object you only have one thread accessing and changing a piece of data. This is generally made safe due to having lexical scope. Specifically one function can operate on a variable's value without affect a global value. If you don't have lexical scope, or poor lexical scoping, then changing the value of a variable named "foo" in one function affects another called "foo". It's less commonly a problem these days, but still common enough to be alluding to.

With threading, you have the same issue, in more subtle ways. Whilst you still have lexical scoping helping you - in that a local value "X" inside one function is independent of another local value called "X" in another, the fact that data structures are mutable is a major source of bugs in threading.

Specifically, if a function is passed a mutable value, then in a threaded environment unless you have taken care, that function cannot guarantee that the value is not being changed by anything else.

This shared state is the source of probably 90-99% of bugs in threaded systems, and can be very hard to debug. As a result, if you're going to write a threaded system you should try to bear in mind the distance that your shared values will travel - ie the scope of parallel access.

In order to limit bugs you have a handful of options which are known to work:

  1. Use no shared state - pass shared data using thread safe queues
  2. Place locks around all shared data, and ensure this is used religiously throughout your code. This can be far harder sometimes than people think. The problem comes from when you "forget" to lock an object - something which is remarkably easy for people to do.
  3. Have a single object - an owner - in charge of shared state. Allow client threads to ask it for copies of values in the shared state, which are accompanied by a token. When they want to update the shared state, they pass back messages to the single object, along with the token they had. The owner can then determine whether an update clash has occured.

1 is most equivalent to unix pipelines. 3 is logically equivalent to version control, and is normally referred to as software transactional memory.

1 & 3 are modes of concurrency supported in Kamaelia which aims to eliminate bugs caused by class 2. (disclosure, I run the Kamaelia project) 2 isn't supported because it relies on "always getting everything right".

No matter which approach you use to solve your problem though, bearing in mind this issue, and the ways of dealing with it, and planning upfront how you intend to deal with it will save you no end of headaches later on.

Michael Sparks
Great. I'd add that threads are not very useful in python, and should be avoided for the extra complexibility and no benefit. They should be used only when the API you want to call is blocking and there's no non-blocking version of it (very rarely).
nosklo