tags:

views:

169

answers:

6

Hey guys. As the title suggests, I'm working on a site written in python and it makes several calls to the urllib2 module to read websites. I then parse them with BeautifulSoup.

As I have to read 5-10 sites, the page takes a while to load.

I'm just wondering if there's a way to read the sites all at once? Or anytricks to make it faster, like should I close the urllib2.urlopen after each read, or keep it open?

Added: also, if I were to just switch over to php, would that be faster for fetching and Parsi g HTML and XML files from other sites? I just want it to load faster, as opposed to the ~20 seconds it currently takes

A: 

1) Are you opening the same site many times, or many different site? If many different sites, I think urllib2 is good. If doing the same site over and over again, I have had some personal luck with urllib3 http://code.google.com/p/urllib3/

2) BeautifulSoup is easy to use, but is pretty slow. If you do have to use it, make sure to decompose your tags to get rid of memory leaks.. or it will likely lead to memory issues (did for me).

What do your memory and cpu look like? If you are maxing your CPU, make sure you are using real heavyweight threads, so you can run on more than 1 core.

bwawok
I'm accessing XML pages for Amazon, eBay, and Half. So while similar, the products and prices change
Jack z
Okay so then urllib2 is fine. You need to thread out your program to use heavyweight threads, and parse as efficiently as possibly.
bwawok
Okay, thanks much!
Jack z
A: 

Edit: Please take a look at Wai's post for a better version of this code. Note that there is nothing wrong with this code and it will work properly, despite the comments below.

The speed of reading web pages is probably bounded by your Internet connection, not Python.

You could use threads to load them all at once.

import thread, time, urllib
websites = {}
def read_url(url):
  websites[url] = urllib.open(url).read()

for url in urls_to_load: thread.start_new(read_url, (url,))
while websites.keys() != urls_to_load: time.sleep(0.1)

# Now websites will contain the contents of all the web pages in urls_to_load
Dumb Guy
The bottleneck is probably not even the internet connection but the remote server. However, BeautifulSoup is slow in any case. So it will add an extra delay.
WoLpH
Oh okay, that makes sense. And I appreciate the example code thanks!
Jack z
-1 for threads *and* suggesting the `thread` module _and_ not doing any locking or *even* using the `Queue` module. You're just going to add way more complexity and locking overhead for no gain if you use threads. Even if this wasn't true, your code demonstrates that you don't really know how to use threads.
Aaron Gallagher
The global interpreter lock should keep the dictionary assignment from happening simultaneously in two different threads. I should have mentioned it, though.
Dumb Guy
So threading isn't the way to go?
Jack z
@Dumb Guy, no, it doesn't. The GIL isn't a replacement for proper locking, and also isn't present in all python implementations. Either way, mutating global state is a horrible, *horrible* way to communicate between threads. This is what the `Queue` module is for.
Aaron Gallagher
@Dumb Guy, I'm sorry that you're being butthurt over your broken code, but it really doesn't "work properly" even if you put that disclaimer at the top of your answer.
Aaron Gallagher
A: 

How about using pycurl?

You can apt-get it by

$ sudo apt-get python-pycurl
OTZ
Pycurl is not faster than urllib2 in my experience
bwawok
+1  A: 

As a general rule, a given construct in any language is not slow until it is measured.

In Python, not only do timings often run counter to intuition but the tools for measuring execution time are exceptionally good.

msw
A: 

Scrapy might be useful for you. If you don't need all of its functionality, you might just use twisted's twisted.web.client.getPage instead. Asynchronous IO in one thread is going to be way more performant and easy to debug than anything that uses multiple threads and blocking IO.

Aaron Gallagher
Okay, I've heard about that being faster. Thanks!
Jack z
@msw, is my answer cut off in your browser? The full sentence is "Asynchronous IO in one thread is going to be way more performant and easy to debug than anything that uses multiple threads and blocking IO."
Aaron Gallagher
@Aaron Gallagher: I should have been more clear; sorry. The OP hasn't even made a case for needing asynchronous IO, and your philosophy of "get it right, first" noted above is a good stance. But I fear it isn't impressing the OP, oh well ;)
msw
A: 

Why is Dumb Guy's answer got -1? He is using old modules alright. But he is the first guy that propose a right approach using threads and provided a working example.

I'm rewriting his code using modern Python modules like threading and Queue.

import threading, urllib2
import Queue

urls_to_load = [
'http://stackoverflow.com/',
'http://slashdot.org/',
'http://www.archive.org/',
'http://www.yahoo.co.jp/',
]

def read_url(url, queue):
    data = urllib2.urlopen(url).read()
    print('Fetched %s from %s' % (len(data), url))
    queue.put(data)

def fetch_parallel():
    result = Queue.Queue()
    threads = [threading.Thread(target=read_url, args = (url,result)) for url in urls_to_load]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    return result

def fetch_sequencial():
    result = Queue.Queue()
    for url in urls_to_load:
        read_url(url,result)
    return result

Best time for find_sequencial() is 2s. Best time for fetch_parallel() is 0.9s.

Also it is incorrect to say thread is useless in Python because of GIL. This is one of those case when thread is useful in Python because the the threads are blocked on I/O. As you can see in my result the parallel case is 2 times faster.

Wai Yip Tung
Yes, *if* this was the only way to fetch URLs, this would be closer to the correct way to use threads. However, async IO is *still* going to be faster, more maintainable, allow for deterministic debugging, and so on. Even without the GIL, it would be a superior solution.
Aaron Gallagher
Oops, it looks like Dump Guy has retracted his answer. Hey I say you were going on the right track!
Wai Yip Tung
Aaron, can you provide a working example to show that async IO code is more maintainable?
Wai Yip Tung
Thanks, wai! I'll give this code a try! :)
Jack z
@Wai Yip Tung, less code is going to be more maintainable than more code, especially if it's immediately obvious what that code does. Threads require more code to do less in order to work around the problems with shared-state concurrency (i.e. you need locks). You could use worker processes instead of worker threads in order to eliminate the shared-state part, but still, you could just use twisted and be done with it.
Aaron Gallagher
Oh, I didn't realize twisted could fetch pages, thanks for metioning that!
Jack z
I've restored my post for reference here.
Dumb Guy