views:

855

answers:

14

I've read a lot recently about how writing multi-threaded apps is a huge pain in the neck, and have learned enough about the topic to understand, at least at some level, why it is so.

I've read that using functional programming techniques can help alleviate some of this pain, but I've never seen a simple example of functional code that is concurrent. So, what are some alternatives to using threads? At least, what are some ways to abstract them away so you needn't think about things like locking and whether a particular library's objects are thread-safe.

I know Google's MapReduce is supposed to help with the problem, but I haven't seen a succinct explanation of it.

Although I'm giving a specific example below, I'm more curious of general techniques than solving this specific problem (using the example to help illustrate other techniques would be helpful though).

I came to the question when I wrote a simple web crawler as a learning exercise. It works pretty well, but it is slow. Most of the bottleneck comes from downloading pages. It is currently single threaded, and thus only downloads a single page at a time. Thus, if the pages can be downloaded concurrently, it would speed things up dramatically, even if the crawler ran on a single processor machine. I looked into using threads to solve the issue, but they scare me. Any suggestions on how to add concurrency to this type of problem without unleashing a terrible threading nightmare?

+1  A: 

Concurrency is quite a complicated subject in computer science, which demands good understanding of hardware architecture as well as operating system behavior.

Multi-threading has many implementations based on your hardware and your hosting OS, and as tough as it is already, the pitfalls are numerous. It should be noted that in order to achieve "true" concurrency, threads are the only way to go. Basically, threads are the only way for you as a programmer to share resources between different parts of your software while allowing them to run in parallel. By parallel you should consider that a standard CPU (dual/multi-cores aside) can only do one thing at a time. Concepts like context switching now come into play, and they have their own set of rules and limitations.

I think you should seek more generic background on the subject, like you are saying, before you go about implementing concurrency in your program.

I guess the best place to start is the wikipedia article on concurrency, and go on from there.

Yuval A
Multithreading isn't the *only* way to achieve concurrency, you can also use multiprocessing. Threads are, after all, lightweight processes.
Jason Day
Multi-processing is obviously a technically available option, but forgive me if I cross it off as an option that should be used only in rare cases.
Yuval A
So you consider the *nix world to consist primarily of "rare cases"?
Dave Sherohman
+21  A: 

The reason functional programming helps with concurrency is not because it avoids using threads.

Instead, functional programming preaches immutability, and the absence of side effects.

This means that an operation could be scaled out to N amount of threads or processes, without having to worry about messing with shared state.

FlySwat
A: 

One simple way to avoid threading in your simple scenario, Is to download from different processes. The main process will invoke other processes with parameters that will download the files to local directory, And then the main process can do the real job.

I don't think that there are any simple solution to those problems. Its not a threading problem. Its the concurrency that brake the human mind.

Igal Serban
+1  A: 

What typically makes multi-threaded programming such a nightmare is when threads share resources and/or need to communicate with each other. In the case of downloading web pages, your threads would be working independently, so you may not have much trouble.

One thing you may want to consider is spawning multiple processes rather than multiple threads. In the case you mention--downloading web pages concurrently--you could split the workload up into multiple chunks and hand each chunk off to a separate instance of a tool (like cURL) to do the work.

Parappa
+9  A: 

Actually, threads are pretty easy to handle until you need to synchronize them. Usually, you use threadpool to add task and wait till they are finished.

It is when threads need to communicate and access shared data structures that multi threading becomes really complicated. As soon as you have two locks, you can get deadlocks, and this is where multithreading gets really hard. Sometimes, your locking code could be wrong by just a few instructions. In that case, you could only see bugs in production, on multi-core machines (if you developed on single core, happened to me) or they could be triggered by some other hardware or software. Unit testing doesn't help much here, testing finds bugs, but you can never be as sure as in "normal" apps.

bh213
+1  A: 

If your goal is to achieve concurrency it will be hard to get away from using multiple threads or processes. The trick is not to avoid it but rather to manage it in a way that is reliable and non-error prone. Deadlocks and race conditions in particular are two aspects of concurrent programming that are easy to get wrong. One general approach to manage this is to use a producer/consumer queue... threads write work items to the queue and workers pull items from it. You must make sure you properly synchronize access to the queue and you're set.

Also, depending on your problem, you may also be able to create a domain specific language which does away with concurrency issues, at least from the perspective of the person using your language... of course the engine which processes the language still needs to handle concurrency, but if this will be leveraged across many users it could be of value.

DSO
A: 

You might watch the MSDN video on the F# language: PDC 2008: An introduction to F#

This includes the two things you are looking for. (Functional + Asynchronous)

TomWij
+7  A: 

I'll add an example of how functional code can be used to safely make code concurrent.

Here is some code you might want to do in parallel, so you don't have wait for one file to finish to start downloading the next:

void DownloadHTMLFiles(List<string> urls)
{
    foreach(string url in urls)
    {
         DownlaodOneFile(url);  //download html and save it to a file with a name based on the url - perhaps used for caching.
    }
}

If you have a number of files the user might spend a minute or more waiting for them all. We can re-write this code functionally like this, and it basically does the exact same thing:

urls.ForEach(DownloadOneFile);

Note that this still runs sequentially. However, not only is it shorter, we've gained an important advantage here. Since each call to the DownloadOneFile function is completely isolated from the others (for our purposes, available bandwidth isn't an issue) you could very easily swap out the ForEach function for another very similar function: one that kicks off each call to DownlaodOneFile on a separate thread from a threadpool.

It turns out .Net has just such a function availabe using Parallel Extensions. So, by using functional programming you can change one line of code and suddenly have something run in parallel that used to run sequentially. That's pretty powerful.

Joel Coehoorn
A: 

For python, this looks like an interesting approach: http://members.verizon.net/olsongt/stackless/why_stackless.html#introduction

Tristan Havelick
Its not. It doesn't help you avoid roping yourself.
Igal Serban
From my quick look at Stackless, it seems to be a cooperative (that is, non-preemptive) soft threading solution, which, in my experience, just increases the difficulty of writing parallel code.
Software Monkey
A: 
A: 

Use Twisted. "Twisted is an event-driven networking engine written in Python" http://twistedmatrix.com/trac/. With it, I could make 100 asynchronous http requests at a time without using threads.

yogman
Don't lose the problem in the example.
Joel Coehoorn
+4  A: 

There are a couple of brief mentions of asynchronous models but no one has really explained it so I thought I'd chime in. The most common method I've seen used as an alternative for multi-threading is asynchronous architectures. All that really means is that instead of executing code sequentially in a single thread, you use a polling method to initiate some functions and then come back and check periodically until there's data available.

This really only works in models like your aforementioned crawler, where the real bottleneck is I/O rather than CPU. In broad strokes, the asynchronous approach would initiate the downloads on several sockets, and a polling loop periodically checks to see if they're finished downloading and when that's done, we can move on to the next step. This allows you to run several downloads that are waiting on the network, by context switching within the same thread, as it were.

The multi-threaded model would work much the same, except using a separate thread rather than a polling loop checking multiple sockets in the same thread. In an I/O bound application, asynchronous polling works almost as well as threading for many use cases, since the real problem is simply waiting for the I/O to complete and not so much the waiting for the CPU to process the data.

Another real world example is for a system that needed to execute a number of other executables and wait for results. This can be done in threads, but it's also considerably simpler and almost as effective to simply fire off several external applications as Process objects, then check back periodically until they're all finished executing. This puts the CPU-intensive parts (the running code in the external executables) in their own processes, but the data processing is all handled asynchronously.

The Python ftp server lib I work on, pyftpdlib uses the Python asyncore library to handle serving FTP clients with only a single thread, and asynchronous socket communication for file transfers and command/response.

See for further reading the Python Twisted library's page on Asynchronous Programming - while somewhat specific to using Twisted, it also introduces async programming from a beginner perspective.

Jay
A: 

Your specific example is seldom solved with multi-threading. As many have said, this class of problems is IO-bound, meaning the processor has very little work to do, and spends most of it's time waiting for some data to arrive over the wire and to process that, and similarly it has to wait for disk buffers to flush so that it can put more of the recently downloaded data on disk.

The method to performance is through the select() facility, or an equivalent system call. The basic process is to open a number of sockets (for the web crawler downloads) and file handles (for storing them to disk). Next you set all of the different sockets and fh to non-blocking mode, meaning that instead of making your program wait until data is available to read after issuing a request, it returns right away with a special code (usually EAGAIN) to indicate that no data is ready. If you looped through all of the sockets in this way you would be polling, which works well, but is still a waste of cpu resources because your reads and writes will almost always return with EAGAIN.

To get around this, all of the sockets and fp's will be collected into a 'fd_set', which is passed to the select system call, then your program will block, waiting on ANY of the sockets, and will awaken your program when there's some data on any of the streams to process.


The other common case, compute bound work, is without a doubt best addressed with some sort of true parallelism (as apposed to the asynchronous concurrency presented above) to access the resources of multiple cpu's. In the case that your cpu bound task is running on a single threaded archetecture, definately avoid any concurrency, as the overhead will actually slow your task down.

TokenMacGuy
+1  A: 

There are some good libraries out there.

java.util.concurrent.ExecutorCompletionService will take a collection of Futures (i.e. tasks which return values), process them in background threads, then bung them in a Queue for you to process further as they complete. Of course, this is Java 5 and later, so isn't available everywhere.

In other words, all your code is single threaded - but where you can identify stuff safe to run in parallel, you can farm it off to a suitable library.

Point is, if you can make the tasks independent, then thread safety isn't impossible to achieve with a little thought - though it is strongly recommended you leave the complicated bit (like implementing the ExecutorCompletionService) to an expert...

Bill Michell