parallel-processing

how to efficiently apply a medium-weight function in parallel

I'm looking to map a modestly-expensive function onto a large lazy seq in parallel. pmap is great but i'm loosing to much to context switching. I think I need to increase the size of the chunk of work thats passed to each thread. I wrote on a function to break the seq into chunks and pmap the function onto each chunk and recombine them...

Is there a simple process-based parallel map for python?

I'm looking for a simple process-based parallel map for python, that is, a function parmap(function,[data]) that would run function on each element of [data] on a different process (well, on a different core, but AFAIK, the only way to run stuff on different cores in python is to start multiple interpreters), and return a list of resu...

How to parallelize execution on remote systems

What's a good method for assigning work to a set of remote machines? Consider an example where the task is very CPU and RAM intensive, but doesn't actually process a large dataset. The language of choice would be Java. I was thinking Hadoop would be a good option, but the dataset passed between remote machines is fairly small, and Had...

Optimal number of threads per core

Let's say I have a 4-core CPU, and I want to run some process in the minimum amount of time. The process is ideally parallelizable, so I can run chunks of it on an infinite number of threads and each thread takes the same amount of time. Since I have 4 cores, I don't expect any speedup by running more threads than cores, since the a sin...

minimum work size of a goroutine

Does anyone know approximately what the minimum work size is needed in order for a goroutine to be beneficial (assuming that there are free cores for the work to be offloaded to)? ...

How to use database server for distributed job scheduling?

I have around 100 computers and few workers on each of them. The already connect to a central database to query for job parameters. Now I have to do job scheduling for them. One job for one worker takes few minutes, doesn't require network connection (except for dealing jobs and reporting) and can be done at any time in any order. Cons...

Difference between multithreaded and parallel programming

Is there a difference between multithreaded and parallel programming? If related (which I guess they are) is parallel programming the successor of multithreaded programming in terms of terminology? The reason I am asking is because I want to buy a book on one (maybe both) of the above. ...

Interlocked and Memory Barriers

I have a question about the following code sample (*m_value* isn't volatile, and every thread runs on a separate processor) void Foo() // executed by thread #1, BEFORE Bar() is executed { Interlocked.Exchange(ref m_value, 1); } bool Bar() // executed by thread #2, AFTER Foo() is executed { return m_value == 1; } Does using Inte...

Master-Slave Pattern for Distributed Environment

Hi, Currently we have a batch driven process at work which runs every 15 mins and everytime it runs it repeats this cycle several times: Calls a sproc and get some data back from the DB Process the data Saves the result back to the DB It can't load all the data in one go because the data are segregated by a number of fields and each...

C# OutOfMemoryException on MemoryStream writing

I have a little sample application I was working on trying to get some of the new .Net 4.0 Parallel Extensions going (they are very nice). I'm running into a (probably really stupid) problem with an OutOfMemoryException. My main app that I'm looking to plug this sample into reads some data and lots of files, does some processing on them,...

How to: Parallel Reduction of many unequally sized arrays in CUDA?

Hi there I am wondering if anyone could suggest the best approach to computing the mean / standard deviation of a large number of relatively small but differently sized arrays in CUDA? The parallel reduction example in the SDK works on a single very large array and it seems the size is conveniently a multiple of the number of threads p...

C# Parallel Extensions Task.Factory.StartNew invokes method on wrong object

Ok, playing around with the .Net 4.0 Parellel Extensions in System.Threading.Tasks. I'm finding what seems like weird behaivor, but I assume I'm jsut doing something wrong. I have an interface and a couple implementing clases, they're simple for this. interface IParallelPipe { void Process(ref BlockingCollection<Stream> stream, long...

concurrent write to same memory address

If two threads try to write to the same address at the same time, is the value after the concurrent write guaranteed to be one of the values that the threads tried to write? or is it possible to get a combination of the bits? Also, is it possible for another thread to read the memory address while the bits are in an unstable state? I ...

Expert system for writing programs?

I am brainstorming an idea of developing a high level software to manipulate matrix algebra equations, tensor manipulations to be exact, to produce optimized C++ code using several criteria such as sizes of dimensions, available memory on the system, etc. Something which is similar in spirit to tensor contraction engine, TCE, but specif...

which sorting method is most suitable for parallel processing?

I am now looking at my old school assignment and want to find the solution of a question. Here is the question: Which sorting method is most suitable for parallel processing? Bubble sort Quick sort Merge sort Selection sort I guess quick sort (or merge sort?) is the answer. Am I correct? ...

Wait for all threads in an Executor to finish?

I'm implementing a parellel quicksort as programming practice, and after I finished, I read the Java tutorial page on Executors, which sound like they could make my code even faster. Unfortunately, I was relying on join()'s to make sure that the program doesn't continue until everything is sorted. Right now I'm using: public static void...

CUDA - Better Occupancy vs Less Global Memory Access?

Hey My CUDA code must work with (reduce to mean/std, calculate histogram) 4 arrays, each 2048 floats long and already stored in the device memory from previous kernels. It is generally advised to launch at least as many blocks as I have multiprocessors. In this case however, I can load each of these arrays into the shared memory of a ...

Concurrency: how does shared memory vs message passing handle large data structures?

In looking at Go and Erlang's approach to concurrency, I noticed that they both rely on message passing. This approach obviously alleviates the need for complex locks because there is no shared state. However, consider the case of many clients wanting parallel read-only access to a single large data structure in memory -- like a suffix...

Intel has just unveiled a new 48 core CPU. What will this move to many cores imply for us programmers?

Intel has just unveiled a new 48 core CPUs. More than just the number of cores, this new architecture seems to introduce a lot of interesting features, such as this one: Things get interesting here - Intel is saying that they have removed hardware cache coherency which effectively means each "tile" will be completely separate in what...

How to parallelize Sudoku solver using Grand Central Dispatch?

As a programming exercise, I just finished writing a Sudoku solver that uses the backtracking algorithm (see Wikipedia for a simple example written in C). To take this a step further, I would like to use Snow Leopard's GCD to parallelize this so that it runs on all of my machine's cores. Can someone give me pointers on how I should go a...