views:

171

answers:

4

In the code below, I am considering using mutli-threading or multi-process for fetching from url. I think pools would be ideal, Can anyone help suggest solution..

Idea: pool thread/process, collect data... my preference is process over thread, but not sure.

import urllib

URL = "http://download.finance.yahoo.com/d/quotes.csv?s=%s&f=sl1t1v&e=.csv"
symbols = ('GGP', 'JPM', 'AIG', 'AMZN','GGP', 'JPM', 'AIG', 'AMZN')
#symbols = ('GGP')

def fetch_quote(symbols):
    url = URL % '+'.join(symbols)
    fp = urllib.urlopen(url)
    try:
        data = fp.read()
    finally:
        fp.close()
    return data

def main():
    data_fp = fetch_quote(symbols)
#    print data_fp
if __name__ =='__main__':
    main()
A: 

Actually it's possible to do it without neither. You can get it done in one thread using asynchronous calls, like for example twisted.web.client.getPage from Twisted Web.

vartec
@vartec no need to go for any 3rd party extra packages. Python2.6+ onwards have pretty good in-built packages for this kind of purposes.
MovieYoda
Uh oh, someone mentioned Twisted, that means that all other answers are going to get downvoted. http://stackoverflow.com/questions/3490173/how-can-i-speed-up-fetching-pages-with-urllib2-in-python/3490191#3490191
Nick T
@movieyoda: well, for obvious reasons (GAE, Jython) I like to stay compatible with 2.5. Anyway, maybe I'm missing out on something, what support for asynchronous web calls was introduced Python 2.6?
vartec
@Nick: unfortunately, because of GIL, Python sucks at threading (I know, function calls are done with GIL released), so you gain nothing from using threads instead of deferred async calls. On the other hand event driven programming rules even in cases when you actually could use threads (vide: ngnix, lighttpd), and obviously in case of Python (Twisted, Tornado).
vartec
@vartec if i am not wrong `multiprocessing` module was made available natively in Python from 2.6 onwards. I think it was called `pyprocessing` before that, a separate 3rd party module.
MovieYoda
@movieyoda: true, but I wouldn't call `multiprocessing` as package for same purpose as async calls
vartec
A: 

As you would know multi-threading in Python is not actually multi-threading due to GIL. Essentially it's a single thread that's running at a given time. So in your program if you want multiple urls to be fetched at any given time, multi-threading might not be the way to go. Also after the crawl you store the data in a single file or some persistent db? The decision here could affect your performance.

multi-processes are more efficient that way but have the time & memory overhead of extra processes spawn. I have explored both these options in Python recently. Here's the url (with code) -

http://stackoverflow.com/questions/3586723/python-multiprocessing-module

MovieYoda
IO code is run without acquiring the GIL. For IO-bound maps `threading` works well.
Andrey Vlasovskikh
Can you give example of above with threading?
all I wanted to say was while considering multi-threading in Python one needs to keep in mind the GIL. After getting the URL data, one may want to parse it (create DOM->CPU Intensive) or directly want to dump it into a file (IO Operation). In the latter the effect of GIL would be downplayed but in the former GIL played a prominent part in the efficiency of the program. Donno why people take it so offensive that they have to downvote the post...
MovieYoda
MovieYoda
I am new to python. I looked at ur code there are no imports. thanks for offer of help.
Oh in that case it should be import multiprocessing.
MovieYoda
+1  A: 

So here's a very simple example. It iterates over symbols passing one at a time to fetch_quote.

import urllib
import multiprocessing

URL = "http://download.finance.yahoo.com/d/quotes.csv?s=%s&f=sl1t1v&e=.csv"
symbols = ('GGP', 'JPM', 'AIG', 'AMZN','GGP', 'JPM', 'AIG', 'AMZN')
#symbols = ('GGP')

def fetch_quote(symbol):
    url = URL % '+'.join(symbol)
    fp = urllib.urlopen(url)
    try:
        data = fp.read()
    finally:
        fp.close()
    return data


def main():

    PROCESSES = 4
    print 'Creating pool with %d processes\n' % PROCESSES
    pool = multiprocessing.Pool(PROCESSES)
    print 'pool = %s' % pool
    print

    results = [pool.apply_async(fetch_quote, sym) for sym in symbols]

    print 'Ordered results using pool.apply_async():'
    for r in results:
        print '\t', r.get()

    pool.close()
    pool.join()

if __name__ =='__main__':
    main()
mluebke
There might be some issues if retrieved pages are quite large. `multiprocessing` uses inter-process communication mechanisms for exchanging information among processes.
Andrey Vlasovskikh
True, the above was for simple illustrative purposes only. YMMV, but I wanted to show how simple it was to take his code and make it multiprocess.
mluebke
I got this error: Creating pool with 4 processespool = <multiprocessing.pool.Pool object at 0x031956D0>Ordered results using pool.apply_async(): Traceback (most recent call last): File "C:\py\Raw\Yh_Mp.py", line 36, in <module> main() File "C:\py\Raw\Yh_Mp.py", line 30, in main print '\t', r.get() File "C:\Python26\lib\multiprocessing\pool.py", line 422, in get raise self._valueTypeError: fetch_quote() takes exactly 1 argument (3 given)
This still does not work,
+1  A: 

You have a process that request, several information at once. Let's try to fetch these information one by one.. Your code will be :

def fetch_quote(symbols):
    url = URL % '+'.join(symbols)
    fp = urllib.urlopen(url)
    try:
        data = fp.read()
    finally:
        fp.close()
    return data

def main():
    for symbol in symbols:
        data_fp = fetch_quote((symbol,))
        print data_fp

if __name__ == "__main__":
    main()

So main() call, one by one every url to get the data. Let's multiprocess it with a pool:

import urllib
from multiprocessing import Pool

URL = "http://download.finance.yahoo.com/d/quotes.csv?s=%s&amp;f=sl1t1v&amp;e=.csv"
symbols = ('GGP', 'JPM', 'AIG', 'AMZN','GGP', 'JPM', 'AIG', 'AMZN')

def fetch_quote(symbols):
    url = URL % '+'.join(symbols)
    fp = urllib.urlopen(url)
    try:
        data = fp.read()
    finally:
        fp.close()
    return data

def main():
    for symbol in symbols:
        data_fp = fetch_quote((symbol,))
        print data_fp

if __name__ =='__main__':
    pool = Pool(processes=5)
    for symbol in symbols:
        result = pool.apply_async(fetch_quote, [(symbol,)])
        print result.get(timeout=1)

In the following main a new process is created to request each symbols urls.

Note: on python, since the GIL is present, multithreading must be mostly considered as a wrong solution.

For documentation see: Multiprocessing in python

ohe
GIL is not an issue here because this task is definitely IO-bound.
Andrey Vlasovskikh
thanks, trying understand code, not just copy it.
I get alot of errors when running longer lists of symbols.
this method is much slower than no multi-processing. If use a list of 150 stocks then errors and very slow. Copy the list above so stocks equal 150. very slow, WOuld threading be better?????
MovieYoda
@movie, thanks, single thread is 2 sec, mutliprocess is 18 secs. for comparison how would I multithread this.... or will the same problem arise.