ansaurus

Question

wget Vs urlretrieve of python

Answer 1

+1 A:

Maybe you can wget and then inspect the data in Python?

Aiden Bell 2009-06-10 10:38:52

sorry, I can't understand what you meant.. are you saying to call wget from python code?

2009-06-10 12:11:53

You could do, or from the shell, taking advantage of the fast download speed ... then process data using Python.

Aiden Bell 2009-06-10 12:50:12

Answer 2

A:

There shouldn't be a difference really. All urlretrieve does is make a simple HTTP GET request. Have you taken out your data processing code and done a straight throughput comparison of wget vs. pure python?

Corey Goldberg 2009-06-10 13:55:34

Answer 3

A:

Please show us some code. I'm pretty sure that it has to be with the code and not on urlretrieve.

I've worked with it in the past and never had any speed related issues.

miya 2009-06-10 14:59:41

Here is the code...http://cs.jhu.edu/~kapild/MyHtmlParser1.pyThe urlretrieve is in the second last line

2009-06-10 16:01:18

Answer 4

+1 A:

import subprocess

myurl = 'http://some_server/data/'
subprocess.call(["wget", "-r", "-np", "-A", "files", myurl])

nosklo 2009-06-10 18:10:46

subprocess is your friend:-)

DoxaLogos 2009-06-11 19:36:43

Answer 5

+10 A:

Probably a unit math error on your part.

Just noticing that 500KB/s (kilobytes) is equal to 4Mb/s (megabits).

Triptych 2009-06-10 18:14:50

+1 very good point

nosklo 2009-06-10 20:09:39

Answer 6

+3 A:

As for the html parsing, the fastest/easiest you will probably get is using lxml As for the http requests themselves: httplib2 is very easy to use, and could possibly speed up downloads because it supports http 1.1 keep-alive connections and gzip compression. There is also pycURL which claims to be very fast (but more difficult to use), and is build on curllib, but I've never used that.

You could also try to download different files concurrently, but also keep in mind that trying to optimize your download times too far may be not very polite towards the website in question.

Sorry for the lack of hyperlinks, but SO tells me "sorry, new users can only post a maximum of one hyperlink"

2009-06-10 18:46:07

Added some links for ya, newb :)

Triptych 2009-06-10 19:47:17

I am not sure if parsing is the problem ... Its retrieving and storing file which is causing the delay...

2009-06-10 22:21:27

Answer 7

+3 A:

Transfer speeds can be easily misleading.. Could you try with the following script, which simply downloads the same URL with both wget and urllib.urlretrieve - run it a few times incase you're behind a proxy which caches the URL on the second attempt.

For small files, wget will take slightly longer due to the external process' startup time, but for larger files that should be come irrelevant.

from time import time
import urllib
import subprocess

target = "http://example.com" # change this to a more useful URL

wget_start = time()

proc = subprocess.Popen(["wget", target])
proc.communicate()

wget_end = time()


url_start = time()
urllib.urlretrieve(target)
url_end = time()

print "wget -> %s" % (wget_end - wget_start)
print "urllib.urlretrieve -> %s"  % (url_end - url_start)

dbr 2009-06-11 13:47:59

Answer 8

+1 A:

urllib works for me as fast as wget. try this code. it shows the progress in percentage just as wget.

import sys, urllib
def reporthook(a,b,c): 
    # ',' at the end of the line is important!
    print "% 3.1f%% of %d bytes\r" % (min(100, float(a * b) / c * 100), c),
    #you can also use sys.stdout.write
    #sys.stdout.write("\r% 3.1f%% of %d bytes" 
    #                 % (min(100, float(a * b) / c * 100), c)
    sys.stdout.flush()
for url in sys.argv[1:]:
     i = url.rfind('/')
     file = url[i+1:]
     print url, "->", file
     urllib.urlretrieve(url, file, reporthook)

Xuan 2010-01-05 01:27:12

Answer 9

A:

You can use wget -k to engage relative links in all urls.

Alex 2010-02-28 09:35:55

ansaurus

tags:

views:

answers:

wget Vs urlretrieve of python

related questions