ansaurus

Question

Recommended language for multithreaded data work

Answer 1

+5 A:

It is possible to do this in Python using the multiprocessing module -- this spawns multiple processes instead of threads, which bypasses the GIL and hence allows true concurrency.

That is not to say that Python is the 'best' language for this job; that's a subjective point which can be argued over. But it is certainly capable of it.

EDIT: Yes, there are several ways to share data between processes. Pipes are the simplest; they are sort-of file-like handles which one process can write to and then another can read from. Straight from the docs:

from multiprocessing import Process, Pipe

def f(conn):
    conn.send([42, None, 'hello'])
    conn.close()

if __name__ == '__main__':
    parent_conn, child_conn = Pipe()
    p = Process(target=f, args=(child_conn,))
    p.start()
    print parent_conn.recv()   # prints "[42, None, 'hello']"
    p.join()

You could for instance have one process performing the first step and sending the results down a pipe to another process for the second step.

katrielalex 2010-08-17 22:27:36

But does this allow me to share state data in the way that I described? I was under the impression that this was the drawback to the multiprocessing module/GIL.

thebackhand 2010-08-17 22:29:58

@thebackhand: yes. See edit.

katrielalex 2010-08-17 22:40:37

Answer 2

A:

If I remember correctly (but I might be wrong here) one of the main purposes of Ada95 was parallel processing. Funny language, that was.

Jokes aside I'm not quite sure how good performance wise it would be (but seeing you are using Python now then it shouldn't be that bad) but I'd suggest Java since the basics of multithreading are quite simple there (but making a well written, complex multithreaded application is rather hard). Heard the Concurrency library is also quite nice, I haven't tried it out myself yet, though.

Zenzen 2010-08-17 22:31:56

Answer 3

+1 A:

On the Python side, your best bet is probably to separate the two steps in two different processes. There are a couple of modules that help you to achieve that. You would couple the two processes through pipes. In order to pass arbitrary data through the pipe, you need to serialize and deserialize it. The pickle module would be a good candidate for this.

If you want to jump ship, languages like Erlang, Haskell, Scala or Clojure have probably the concurrency features you are looking for, but I don't know how well they would integrate with R or some other statistical package that suits you.

ThomasH 2010-08-17 22:35:24

Answer 4

+4 A:

It is pretty easy to do multiprocessing with R (or definitely not harder than in Python); check out multicore package and other listed here.

mbq 2010-08-17 22:40:24

Answer 5

+2 A:

I find that using R with the foreach package is really an easy way to use multithreading in your code. Use the doMC or the doMPI package as the parallel back-end if you have a UNIX-alike or windows respectively. The vignette should get you going fairly quickly. This method is mostly best for parallelizing for loops, and i find that using 7 of the 8 cores on my machine usually give nearly a sixfold speed increase. I'm not sure that you can start a second process based on results of the first, but it is worth a quick look.

Good luck. Sorry that I am a new user and can only post one link, or I would have linked all of the other pages as well.

Seth 2010-08-18 00:07:55

Answer 6

A:

My experiences with Java multi-threading have been very positive, although it does take a lot of getting used to. At the end of the day, the biggest problem wasn't the syntax or the java features but the different mindset you need to develop multi-threaded code.

If you're using eclipse there's also a profiling suite that's very helpful in debugging and optimisation. Causes a rather big performance hit though :)

Michael Clerx 2010-08-18 01:01:17

ansaurus

tags:

views:

answers:

Recommended language for multithreaded data work

related questions