ansaurus

Question

python multi threading/ multiprocess code

Answer 1

A:

Actually it's possible to do it without neither. You can get it done in one thread using asynchronous calls, like for example twisted.web.client.getPage from Twisted Web.

vartec 2010-09-08 16:38:16

@vartec no need to go for any 3rd party extra packages. Python2.6+ onwards have pretty good in-built packages for this kind of purposes.

MovieYoda 2010-09-08 16:46:54

Uh oh, someone mentioned Twisted, that means that all other answers are going to get downvoted. http://stackoverflow.com/questions/3490173/how-can-i-speed-up-fetching-pages-with-urllib2-in-python/3490191#3490191

Nick T 2010-09-08 18:02:04

@movieyoda: well, for obvious reasons (GAE, Jython) I like to stay compatible with 2.5. Anyway, maybe I'm missing out on something, what support for asynchronous web calls was introduced Python 2.6?

vartec 2010-09-09 07:50:03

@Nick: unfortunately, because of GIL, Python sucks at threading (I know, function calls are done with GIL released), so you gain nothing from using threads instead of deferred async calls. On the other hand event driven programming rules even in cases when you actually could use threads (vide: ngnix, lighttpd), and obviously in case of Python (Twisted, Tornado).

vartec 2010-09-09 07:54:23

@vartec if i am not wrong `multiprocessing` module was made available natively in Python from 2.6 onwards. I think it was called `pyprocessing` before that, a separate 3rd party module.

MovieYoda 2010-09-09 13:50:43

@movieyoda: true, but I wouldn't call `multiprocessing` as package for same purpose as async calls

vartec 2010-09-09 15:57:39

Answer 2

A:

As you would know multi-threading in Python is not actually multi-threading due to GIL. Essentially it's a single thread that's running at a given time. So in your program if you want multiple urls to be fetched at any given time, multi-threading might not be the way to go. Also after the crawl you store the data in a single file or some persistent db? The decision here could affect your performance.

multi-processes are more efficient that way but have the time & memory overhead of extra processes spawn. I have explored both these options in Python recently. Here's the url (with code) -

http://stackoverflow.com/questions/3586723/python-multiprocessing-module

MovieYoda 2010-09-08 16:45:46

IO code is run without acquiring the GIL. For IO-bound maps `threading` works well.

Andrey Vlasovskikh 2010-09-08 16:49:28

Can you give example of above with threading?

2010-09-08 18:18:38

all I wanted to say was while considering multi-threading in Python one needs to keep in mind the GIL. After getting the URL data, one may want to parse it (create DOM->CPU Intensive) or directly want to dump it into a file (IO Operation). In the latter the effect of GIL would be downplayed but in the former GIL played a prominent part in the efficiency of the program. Donno why people take it so offensive that they have to downvote the post...

MovieYoda 2010-09-08 19:34:50

MovieYoda 2010-09-08 19:37:15

I am new to python. I looked at ur code there are no imports. thanks for offer of help.

2010-09-08 20:01:35

Oh in that case it should be import multiprocessing.

MovieYoda 2010-09-09 04:28:14

Answer 3

+1 A:

So here's a very simple example. It iterates over symbols passing one at a time to fetch_quote.

import urllib
import multiprocessing

URL = "http://download.finance.yahoo.com/d/quotes.csv?s=%s&amp;f=sl1t1v&amp;e=.csv"
symbols = ('GGP', 'JPM', 'AIG', 'AMZN','GGP', 'JPM', 'AIG', 'AMZN')
#symbols = ('GGP')

def fetch_quote(symbol):
    url = URL % '+'.join(symbol)
    fp = urllib.urlopen(url)
    try:
        data = fp.read()
    finally:
        fp.close()
    return data


def main():

    PROCESSES = 4
    print 'Creating pool with %d processes\n' % PROCESSES
    pool = multiprocessing.Pool(PROCESSES)
    print 'pool = %s' % pool
    print

    results = [pool.apply_async(fetch_quote, sym) for sym in symbols]

    print 'Ordered results using pool.apply_async():'
    for r in results:
        print '\t', r.get()

    pool.close()
    pool.join()

if __name__ =='__main__':
    main()

mluebke 2010-09-08 16:50:25

There might be some issues if retrieved pages are quite large. `multiprocessing` uses inter-process communication mechanisms for exchanging information among processes.

Andrey Vlasovskikh 2010-09-08 16:56:36

True, the above was for simple illustrative purposes only. YMMV, but I wanted to show how simple it was to take his code and make it multiprocess.

mluebke 2010-09-08 17:00:16

I got this error: Creating pool with 4 processespool = <multiprocessing.pool.Pool object at 0x031956D0>Ordered results using pool.apply_async(): Traceback (most recent call last): File "C:\py\Raw\Yh_Mp.py", line 36, in <module> main() File "C:\py\Raw\Yh_Mp.py", line 30, in main print '\t', r.get() File "C:\Python26\lib\multiprocessing\pool.py", line 422, in get raise self._valueTypeError: fetch_quote() takes exactly 1 argument (3 given)

2010-09-08 17:21:23

This still does not work,

2010-09-16 15:46:15

Answer 4

+1 A:

You have a process that request, several information at once. Let's try to fetch these information one by one.. Your code will be :

def fetch_quote(symbols):
    url = URL % '+'.join(symbols)
    fp = urllib.urlopen(url)
    try:
        data = fp.read()
    finally:
        fp.close()
    return data

def main():
    for symbol in symbols:
        data_fp = fetch_quote((symbol,))
        print data_fp

if __name__ == "__main__":
    main()

So main() call, one by one every url to get the data. Let's multiprocess it with a pool:

import urllib
from multiprocessing import Pool

URL = "http://download.finance.yahoo.com/d/quotes.csv?s=%s&amp;f=sl1t1v&amp;e=.csv"
symbols = ('GGP', 'JPM', 'AIG', 'AMZN','GGP', 'JPM', 'AIG', 'AMZN')

def fetch_quote(symbols):
    url = URL % '+'.join(symbols)
    fp = urllib.urlopen(url)
    try:
        data = fp.read()
    finally:
        fp.close()
    return data

def main():
    for symbol in symbols:
        data_fp = fetch_quote((symbol,))
        print data_fp

if __name__ =='__main__':
    pool = Pool(processes=5)
    for symbol in symbols:
        result = pool.apply_async(fetch_quote, [(symbol,)])
        print result.get(timeout=1)

In the following main a new process is created to request each symbols urls.

Note: on python, since the GIL is present, multithreading must be mostly considered as a wrong solution.

For documentation see: Multiprocessing in python

ohe 2010-09-08 16:53:38

GIL is not an issue here because this task is definitely IO-bound.

Andrey Vlasovskikh 2010-09-08 16:57:43

thanks, trying understand code, not just copy it.

2010-09-08 17:00:58

I get alot of errors when running longer lists of symbols.

2010-09-08 17:33:01

this method is much slower than no multi-processing. If use a list of 150 stocks then errors and very slow. Copy the list above so stocks equal 150. very slow, WOuld threading be better?????

2010-09-08 18:18:02

MovieYoda 2010-09-08 19:47:09

@movie, thanks, single thread is 2 sec, mutliprocess is 18 secs. for comparison how would I multithread this.... or will the same problem arise.

2010-09-08 19:55:23

ansaurus

tags:

views:

answers:

python multi threading/ multiprocess code

related questions