views:

2649

answers:

23

As someone in the world of HPC who came from the world of enterprise web development, I'm always curious to see how developers back in the "real world" are taking advantage of parallel computing. This is much more relevant now that all chips are going multicore, and it'll be even more relevant when there are thousands of cores on a chip instead of just a few.

My questions are:

  1. How does this affect your software roadmap?
  2. I'm particularly interested in real stories about how multicore is affecting different software domains, so specify what kind of development you do in your answer (e.g. server side, client-side apps, scientific computing, etc).
  3. What are you doing with your existing code to take advantage of multicore machines, and what challenges have you faced? Are you using OpenMP, Erlang, Haskell, CUDA, TBB, UPC or something else?
  4. What do you plan to do as concurrency levels continue to increase, and how will you deal with hundreds or thousands of cores?
  5. If your domain doesn't easily benefit from parallel computation, then explaining why is interesting, too.

Finally, I've framed this as a multicore question, but feel free to talk about other types of parallel computing. If you're porting part of your app to use MapReduce, or if MPI on large clusters is the paradigm for you, then definitely mention that, too.

Update: If you do answer #5, mention whether you think things will change if there get to be more cores (100, 1000, etc) than you can feed with available memory bandwidth (seeing as how bandwidth is getting smaller and smaller per core). Can you still use the remaining cores for your application?

+6  A: 
Dmitri Nesteruk
I really like #4 here.
tgamblin
About "decent support in .NET": check out PLINQ, available today as a CTP and coming in the .net 4.0 release.
Mauricio Scheffer
A: 

learning haskell

+6  A: 

Hi,

I work in medical imaging and image processing.

We're handling multiple cores in much the same way we handled single cores-- we have multiple threads already in the applications we write in order to have a responsive UI.

However, because we can now, we're taking strong looks at implementing most of our image processing operations in either CUDA or OpenMP. The Intel Compiler provides a lot of good sample code for OpenMP, and is just a much more mature product than CUDA, and provides a much larger installed base, so we're probably going to go with that.

What we tend to do for expensive (ie, more than a second) operations is to fork that operation off into another process, if we can. That way, the main UI remains responsive. If we can't, or it's just far too inconvenient or slow to move that much memory around, the operation is still in a thread, and then that operation can itself spawn multiple threads.

The key for us is to make sure that we don't hit concurrency bottlenecks. We develop in .NET, which means that UI updates have to be done from an Invoke call to the UI in order to have the main thread update the UI.

Maybe I'm lazy, but really, I don't want to have to spend too much time figuring a lot of this stuff out when it comes to parallelizing things like matrix inversions and the like. A lot of really smart people have spent a lot of time making that stuff fast like nitrous, and I just want to take what they've done and call it. Something like CUDA has an interesting interface for image processing (of course, that's what it's defined for), but it's still too immature for that kind of plug-and-play programming. If I or another developer get a lot of spare time, we might give it a try. So instead, we'll just go with OpenMP to make our processing faster (and that's definitely on the development roadmap for the next few months).

mmr
Thanks for the nice answer. Have you taken a look at the latest Portland Group Compilers? It's just a preview right now, but they've got preliminary support for automatic acceleration using CUDA: http://www.pgroup.com/resources/accel.htm
tgamblin
That looks very interesting. I'm on Windows, but if the compiler can be ported, then I'd definitely be down.
mmr
I believe they do come for windows -- PGI is included in this: http://www.microsoft.com/hpc/en/us/developer-resources.aspx, though it only mentions Fortran. But PGI's website mentions 8.0 coming for Windows here: http://www.pgroup.com/support/install.htm#win_info. I have not tried this, though.
tgamblin
+4  A: 

I'm in image processing. We're taking advantage of multicore where possible by processing images in slices doled out to different threads.

plinth
hey! i've got a similar problem right now, mind taking a look? :) http://stackoverflow.com/questions/973608/fast-interleaving-of-data
moogs
I did this too for a similar application. Splitting the image in a number of chunks equal to the number of cores available. For a dual core machine I gained a 15% performance boost by splitting the image in half and using a thread for each to do the work.
Andrei Vajna II
@Andrei - There is a example application in the book "C# 2008 and 2005 Threaded Programming" that does exactly the same thing. It may be a good reference to compare against your solution.
Dave M
+2  A: 

My graduate work is in developing concepts for doing bare-metal multicore work & teaching same in embedded systems.

I'm also working a bit with F# to bring up my high-level multiprocess-able language facilities to speed.

Paul Nathan
+6  A: 

I'm developing ASP.NET web applications. There is little possibility to use multicore directly in my code, however IIS already scales well for multiple cores/CPU's by spawning multiple worker threads/processes when under load.

Vilx-
+16  A: 

For web applications it's very, very easy: ignore it. Unless you've got some code that really begs to be done in parallel you can simply write old-style single-threaded code and be happy.

You usually have a lot more requests to handle at any given moment than you have cores. And since each one is handled in its own Thread (or even process, depending on your technology) this is already working in parallel.

The only place you need to be careful is when accessing some kind of global state that requires synchronization. Keep that to a minimum to avoid introducing artificial bottlenecks to an otherwise (almost) perfectly scalable world.

So for me multi-core basically boils down to these items:

  • My servers have less "CPUs" while each one sports more cores (not much of a difference to me)
  • The same number of CPUs can substain a larger amount of concurrent users
  • When the seems to be performance bottleneck that's not the result of the CPU being 100% loaded, then that's an indication that I'm doing some bad synchronization somewhere.
Joachim Sauer
Good answer. How about the long-term scalability question? Do you anticipate having to change any of this if you start getting more cores on a chip than you can feed? With 1000 cores, you might not have the memory bandwidth for all those requests. Can you still use the rest of the cores?
tgamblin
In the area I work in mostly (web applications that are mostly database bound with the occasional logic) I don't expect that I need to change this in the foreseeable future (but such predictions have been known to be wrong), since their main bottleneck is usually the DB and nothing else.
Joachim Sauer
That being said, there are parts (batch processing, the rare CPU bound part) where writing good multi-threaded code can definitely help and here I face pretty much the same problems/solutions as everyone else.
Joachim Sauer
It's important to note that Apache doesn't even use threading, internally. It simply spawns new processes to handle the additional requests.
Nolte Burke
Nolte: if you're using a Thread per request or a process doesn't really matter in this context. The idea is the same.
Joachim Sauer
Actually, the bit about Apache not using threads is outdated at this point.
BobbyShaftoe
+1  A: 

Our domain logic is based heavily on a workflow engine and each workflow instance runs off the ThreadPool.

That's good enough for us.

NathanE
+1  A: 

I can now separate my main operating system from my development / install whatever I like os using vitualisation setups with Virtual PC or VMWare.

Dual core means that one CPU runs my host OS, the other runs my development OS with a decent level of performance.

Richard Ev
+24  A: 

My research work includes work on compilers and on spam filtering. I also do a lot of 'personal productivity' Unix stuff. Plus I write and use software to administer classes that I teach, which includes grading, testing student code, tracking grades, and myriad other trivia.

  1. Multicore affects me not at all except as a research problem for compilers to support other applications. But those problems lie primarily in the run-time system, not the compiler.
  2. At great trouble and expense, Dave Wortman showed around 1990 that you could parallelize a compiler to keep four processors busy. Nobody I know has ever repeated the experiment. Most compilers are fast enough to run single-threaded. And it's much easier to run your sequential compiler on several different source files in parallel than it is to make your compiler itself parallel. For spam filtering, learning is an inherently sequential process. And even an older machine can learn hundreds of messages a second, so even a large corpus can be learned in under a minute. Again, training is fast enough.
  3. The only significant way I have of exploiting parallel machines is using parallel make. It is a great boon, and big builds are easy to parallelize. Make does almost all the work automatically. The only other thing I can remember is using parallelism to time long-running student code by farming it out to a bunch of lab machines, which I could do in good conscience because I was only clobbering a single core per machine, so using only 1/4 of CPU resources. Oh, and I wrote a Lua script that will use all 4 cores when ripping MP3 files with lame. That script was a lot of work to get right.
  4. I will ignore tens, hundreds, and thousands of cores. The first time I was told "parallel machines are coming; you must get ready" was 1984. It was true then and is true today that parallel programming is a domain for highly skilled specialists. The only thing that has changed is that today manufacturers are forcing us to pay for parallel hardware whether we want it or not. But just because the hardware is paid for doesn't mean it's free to use. The programming models are awful, and making the thread/mutex model work, let alone perform well, is an expensive job even if the hardware is free. I expect most programmers to ignore parallelism and quietly get on about their business. When a skilled specialist comes along with a parallel make or a great computer game, I will quietly applaud and make use of their efforts. If I want performance for my own apps I will concentrate on reducing memory allocations and ignore parallelism.
  5. Parallelism is really hard. Most domains are hard to parallelize. A widely reusable exception like parallel make is cause for much rejoicing.

Summary (which I heard from a keynote speaker who works for a leading CPU manufacturer): the industry backed into multicore because they couldn't keep making machines run faster and hotter and they didn't know what to do with the extra transistors. Now they're desperate to find a way to make multicore profitable because if they don't have profits, they can't build the next generation of fab lines. The gravy train is over, and we might actually have to start paying attention to software costs.

Many people who are serious about parallelism are ignoring these toy 4-core or even 32-core machines in favor of GPUs with 128 processors or more. My guess is that the real action is going to be there.

Norman Ramsey
I don't think that *purposely* ignoring parallelism is a good approach, specially when it's pretty clear that the trend is more and more cores. Also, programming models are getting easier, for example with PLINQ and Intel's Parallel Studio.
Mauricio Scheffer
Over the years I have saved hundreds if not thousands of hours by ignoring parallelism. Parallelism exists to serve me; not the other way around. Last month when I had to test 30 long-running student programs I happily used 30 cores spread over 15 machines, but that was a rare event.
Norman Ramsey
+5  A: 

So far, nothing more than more efficient compilation with make:

gmake -j

the -j option allows tasks that don't depend on one another to run in parallel.

Nathan Fellman
+1  A: 

Learning a functional programming language might use multiple cores... costly.

I think it's not really hard to use extra cores. There are some trivialities as web apps that does not need to have any extra care as the web server does its work running the queries in parallel. The questions are for long running algorythms (long is what you call long). These need to be split over smaller domains that does not depend each other, or synchronize the dependencies. A lot of algs can do this, but sometimes horribly different implementations needed (costs again).

So, no silver bullet until you are using imperative programming languages, sorry. Either you need skilled programmers (costly) or you need to turn to an other programming language (costly). Or you may have luck simply (web).

Szundi
A: 

I work in C# with .Net Threads. You can combine object-oriented encapsulation with Thread management.

I've read some posts from Peter talking about a new book from Packt Publishing and I've found the following article in Packt Publishing web page:

http://www.packtpub.com/article/simplifying-parallelism-complexity-c-sharp

I've read Concurrent Programming with Windows, Joe Duffy's book. Now, I am waiting for "C# 2008 and 2005 Threaded Programming", Hillar's book - http://www.amazon.com/2008-2005-Threaded-Programming-Beginners/dp/1847197108/ref=pd_rhf_p_t_2

I agree with Szundi "No silver bullet"!

A: 

Dear Saua,

You say "For web applications it's very, very easy: ignore it. Unless you've got some code that really begs to be done in parallel you can simply write old-style single-threaded code and be happy."

I am working with Web applications and I do need to take full advantage of parallelism. I understand your point. However, we must prepare for the multicore revolution. Ignoring it is the same than ignoring the GUI revolution in the 90's.

We are not still developing for DOS? We must tackle multicore or we'll be dead in many years.

+2  A: 

We create the VivaMP code analyzer for error detecting in parallel OpenMP programs.

VivaMP is a lint-like static C/C++ code analyzer meant to indicate errors in parallel programs based on OpenMP technology. VivaMP static analyzer adds much to the abilities of the existing compilers, diagnoses any parallel code which has some errors or is an eventual source of such errors. The analyzer is integrated into VisualStudio2005/2008 development environment.

VivaMP – a tool for OpenMP

32 OpenMP Traps For C++ Developers

+2  A: 

I said some of this in answer to a different question (hope this is OK!): there is a concept/methodology called Flow-Based Programming (FBP) that has been around for over 30 years, and is being used to handle most of the batch processing at a major Canadian bank. It has thread-based implementations in Java and C#, although earlier implementations were fiber-based (C++ and mainframe Assembler). Most approaches to the problem of taking advantage of multicore involve trying to take a conventional single-threaded program and figure out which parts can run in parallel. FBP takes a different approach: the application is designed from the start in terms of multiple "black-box" components running asynchronously (think of a manufacturing assembly line). Since the interface between components is data streams, FBP is essentially language-independent, and therefore supports mixed-language applications, and domain-specific languages. Applications written this way have been found to be much more maintainable than conventional, single-threaded applications, and often take less elapsed time, even on single-core machines.

Paul Morrison
+2  A: 

I believe that "Cycles are an engineers' best friend".

My company provides a commercial tool for analyzing and transforming very large software systems in many computer languages. "Large" means 10-30 million lines of code. The tool is the DMS Software Reengineering Toolkit (DMS for short).

Analyses (and even transformations) on such huge systems take a long time: our points-to analyzer for C code takes 90 CPU hours on an x86-64 with 16 Gb RAM. Engineers want answers faster than that.

Consequently, we implemented DMS in PARLANSE, a parallel programming language of our own design, intended to harness small-scale multicore shared memory systems. See http://www.semdesigns.com/products/parlanse/index.html The key ideas behind parlanse are: a) let the programmer expose parallelism, b) let the compiler choose which part it can realize, c) keep the context switching to an absolute minimum. Static partial orders over computations are an easy to help achieve all 3; easy to say, relatively easy to measure costs, easy for compiler to schedule computations. (Writing parallel quicksort with this is trivial).

Unfortunately, we did this in 1996 :-( The last few years have finally been a vindication; I can now get 8 core machines at Fry's for under $1K and 24 core machines for about the same price as a small car (and likely to drop rapidly).

The good news is that DMS is now a fairly mature, and there are a number of key internal mechanisms in DMS which take advantage of this, notably an entire class of analyzers call "attribute grammars", which we write using a domain-specific language which is NOT parlanse. DMS compiles these atrribute grammars into PARLANSE and then they are executed in parallel. Our C++ front end uses attribute grammars, and is about 100K sloc; it is compiled into 800K SLOC of parallel parlanse code that actually works reliably.

Now, we are pretty busy making DMS useful, and don't always have enough time to harness the parallelism well. Thus the 90 hour points-to analysis. We are working on parallelizing that, and have reasonable hope of 10-20x speedup.

We believe that in the long run, harnessing SMP well will make workstations far more friendly to engineers asking hard questions. As well they should.

Ira Baxter
A: 

I think this trend will first persuade some developers, and then most of them will see that parallelization is a really complex task. I expect some design pattern to come to take care of this complexity. Not low level ones but architectural patterns which will make hard to do something wrong.

For example I expect messaging patterns to gain popularity, because it's inherently asynchronous, but you don't think about deadlock or mutex or whatever.

Nicolas Dorier
+1  A: 

I'm using and programming on a Mac. Grand Central Dispatch for the win. The Ars Technica review of Snow Leopard has a lot of interesting things to say about multicore programming and where people (or at least Apple) are going with it.

Shea Daniels
+1  A: 

I've decided to take advantage of multiple cores in an implementation of the DEFLATE algorithm. MArc Adler did something similar in C code with PIGZ (parallel gzip). I've delivered the philosophical equivalent, but in a managed code library, in DotNetZip v1.9. This is not a port of PIGZ, but a similar idea, implemented independently.

The idea behind DEFLATE is to scan a block of data, look for repeated sequences, build a "dictionary" that maps a short "code" to each of those repeated sequences, then emit a byte stream where each instance of one of the repeated sequences is replaced by a "code" from the dictionary.

Because building the dictionary is CPU intensive, DEFLATE is a perfect candidate for parallelization. i've taken a Map+Reduce type approach, where I divide the incoming uncompressed bytestreeam into a set of smaller blocks (map), say 64k each, and then compress those independently. Then I concatenate the resulting blocks together (reduce). Each 64k block is compressed independently, on its own thread, without regard for the other blocks.

On a dual-core machine, this approach compresses in about 54% of the time of the traditional serial approach. On server-class machines, with more cores available, it can potentially deliver even better results; with no server machine, I haven't tested it personally, but people tell me it's fast.


There's runtime (cpu) overhead associated to the management of multiple threads, runtime memory overhead associated to the buffers for each thead, and data overhead associated to concatenating the blocks. So this approach pays off only for larger bytestreams. In my tests, above 512k, it can pay off. Below that, it is better to use a serial approach.


DotNetZip is delivered as a library. My goal was to make all of this transparent. So the library automatically uses the extra threads when the buffer is above 512kb. There's nothing the application has to do, in order to use threads. It just works, and when threads are used, it's magically faster. I think this is a reasonable approach to take for most libbraries being consumed by applications.


It would be nice for the computer to be smart about automatically and dynamically exploiting resources on parallizable algorithms, but the reality today is that apps designers have to explicitly code the parallelization in.


Cheeso
A: 
  1. How does this affect your software roadmap?
    It doesn't. Our (as with almost all other) business related apps run perfectly well on a single core. So long as adding more cores doesn't significantly reduce the performance of single threaded apps, we're happy

  2. ...real stories...
    Like everyone else, parallel builds are the main benefit we get. The Visual Studio 2008 C# compiler doesn't seem to use more than one core though, which really sucks

  3. What are you doing with your existing code to take advantage of multicore machines
    We may look into using the .NET parallel extensions if we ever have a long-running algorithm that can be parallelized, but the odds of this actually occurring are slim. The most likely answer is that some of the developers will play around with it for interest's sake, but not much else

  4. how will you deal with hundreds or thousands of cores?
    Head -> Sand.

  5. If your domain doesn't easily benefit from parallel computation, then explaining why is interesting, too.
    The client app mostly pushes data around, the server app mostly relies on SQL server to do the heavy lifting

Orion Edwards
+2  A: 

We're having a lot of success with task parallelism in .NET 4 using F#. Our customers are crying out for multicore support because they don't want their n-1 cores idle!

Jon Harrop
A: 

Weighing in on our projects:

Form Filling

We have a VB.NET forms automation app that asks the user questions from a wizard-style (not form-picture or PDF style) GUI. When the user answers a question that adds a form, a blank bitmap object of the form is created in the background from a library of TIFFs and kept in memory. When it's time to apply electronic signatures, the forms - sometimes 30+ pages worth - are made by doing a fast clone of the blank bitmaps, filling in the fields, and displaying them. This process uses a couple threads per CPU and the result is fantastic! Multithreading (along with several other significant caching techniques) allows a case that took 3 minutes to build non-optimized, build in about 20 seconds, after less than 2 weeks of effort. I'd never sped up anything (mature) that well before.

When just the data is uploaded, the whole process is repeated on our imaging servers in batch mode, and using only about 70% of the total processing power of an 8-core server (circa 2007, not sure what it is), we can write out about 10 pages per second to TIFFs. That adds up fast!

This actually works using several processes, each with several threads. Locks needed to keep .net's GDI+ from freaking out caused a limit per process.

Full-text smart search

One thread does the I/O, then feeds it to several queues: decompress, decrypt, parse XML, execute the search 'smart search' criteria (code that examines large cases saved in XML) and save hits. Each queue has 2 threads. Speed more than tripled (now it is bound by I/O).

There are other examples I've tried; these are the most dramatic.

Development tools

I am suprised how easy it is using SyncLock in VB coupled with queue (just make a generic list of objects, synclock it when you access it any way (add/remove/status check); making sure the worker thread completely pulls of its reference and unlocks before processing) and just starting new threads to handle a slice of the work, there is little need for threadpools, backgroundworker, etc; these features of .Net save some time, but really only a little over the basic .Net functionaltiy

FastAl