tags:

views:

413

answers:

3

How do I seek to a particular position on a remote file so I can download only that part?

Lets say the bytes on a remote file were: 1234567890

I wanna seek to 4 and download 3 bytes from there so I would have: 456

and also, how do I check if a remote file exists? I tried, os.path.isfile() but it returns False when I'm passing a remote file url.

+6  A: 

If you are downloading the remote file through HTTP, you need to set the Range header.

Check in this example how it can be done. Looks like this:

myUrlclass.addheader("Range","bytes=%s-" % (existSize))

EDIT: I just found a better implementation. This class is very simple to use, as it can be seen in the docstring.

class HTTPRangeHandler(urllib2.BaseHandler):
"""Handler that enables HTTP Range headers.

This was extremely simple. The Range header is a HTTP feature to
begin with so all this class does is tell urllib2 that the 
"206 Partial Content" reponse from the HTTP server is what we 
expected.

Example:
    import urllib2
    import byterange

    range_handler = range.HTTPRangeHandler()
    opener = urllib2.build_opener(range_handler)

    # install it
    urllib2.install_opener(opener)

    # create Request and set Range header
    req = urllib2.Request('http://www.python.org/')
    req.header['Range'] = 'bytes=30-50'
    f = urllib2.urlopen(req)
"""

def http_error_206(self, req, fp, code, msg, hdrs):
    # 206 Partial Content Response
    r = urllib.addinfourl(fp, hdrs, req.get_full_url())
    r.code = code
    r.msg = msg
    return r

def http_error_416(self, req, fp, code, msg, hdrs):
    # HTTP's Range Not Satisfiable error
    raise RangeError('Requested Range Not Satisfiable')
jbochi
+1 for the update w/ better implementation.
Kevin Little
just what I needed. thanks.
Marconi
A: 

I think the key to your question is that you said "remote file url". This implies that you are using an HTTP URL to download a file with an HTTP "get" operation.

So I just did a Google search for "HTTP get" and I found this for you:

http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35

It looks like you can specify a byte range in an HTTP get.

So, you need to use an HTTP library that lets you specify the byte range. And as I was typing this, jbochi posted a link to an example.

steveha
+1  A: 

AFAIK, this is not possible using fseek() or similar. You need to use the HTTP Range header to achieve this. This header may or may not be supported by the server, so your mileage may vary.

import urllib2

myHeaders = {'Range':'bytes=0-9'}

req = urllib2.Request('http://www.promotionalpromos.com/mirrors/gnu/gnu/bash/bash-1.14.3-1.14.4.diff.gz',headers=myHeaders)

partialFile = urllib2.urlopen(req)

s2 = (partialFile.read())

EDIT: This is of course assuming that by remote file you mean a file stored on a HTTP server...

If the file you want is on an FTP server, FTP only allows to to specify a start offset and not a range. If this is what you want, then the following code should do it (not tested!)

import ftplib
fileToRetrieve = 'somefile.zip'
fromByte = 15
ftp = ftplib.FTP('ftp.someplace.net')
outFile = open('partialFile', 'wb')
ftp.retrbinary('RETR '+ fileToRetrieve, outFile.write, rest=str(fromByte))
outFile.close()
Chinmay Kanchi
You should also treat the 206 response codes, because they might be acceptable if you you are using the HTTP range header.
jbochi
Fair enough. Your answer does that though :)
Chinmay Kanchi