In this post, I will be describing about one of the methods I discovered to download large files from a server which actually imposed file-size limit. So, Intuitively what do you think I am gonna do? I will be downloading it by parts. But how?
At first, I started testing how is the proxy server actually able to detect such requests. I found that it merely checks the length field of the incoming packet and throws an error if its size exceeds the maximum specified value.
Then I went through the HTTP Request Protocol. At some stage I came to know that there is a special field in the header called Range. With this we can actually request the start and end points of the bytes in the file in zero-based manner.
For example, if you want to download a file of size say 50 bytes(that's too tiny now-a-days). You only want bytes from 34 to 43 say. Then the HTTP request looks as follows:
This problem was solved when I looked at the response from the server. For instance, the response looks as follows when the above request is sent:
We are ready with the idea and its a matter of coding the above idea. I have used the urllib2 from python, since I was too lazy to code it in C.
As an extension to this, I have used threads which increased the speed to large extent and it was as if I downloaded from my Local Area Network rather than the Internet.
As you may argue whats so special about this? There are many download accelerators that employ this method. But wait a minute. None of them employ the exact method I have described above. Indeed, well-known programs such as axel,wget and others failed to download the file when such constraint was introduced. So, I guess my idea is a bit better given these circumstances :).
Here is the code for my idea in Python:
At first, I started testing how is the proxy server actually able to detect such requests. I found that it merely checks the length field of the incoming packet and throws an error if its size exceeds the maximum specified value.
Then I went through the HTTP Request Protocol. At some stage I came to know that there is a special field in the header called Range. With this we can actually request the start and end points of the bytes in the file in zero-based manner.
For example, if you want to download a file of size say 50 bytes(that's too tiny now-a-days). You only want bytes from 34 to 43 say. Then the HTTP request looks as follows:
GET file_name.extension HTTP/1.1 .... .... Range: bytes=33-42 ... ...It starts from 33 because it is zero-based, meaning the first byte starts at zero and so on. Now, I am able to figure out how to download a file by parts. But another question remains. What will be the size of the entire file? How to figure this out?
This problem was solved when I looked at the response from the server. For instance, the response looks as follows when the above request is sent:
HTTP/1.1 206 Partial content .... .... Content-Range: bytes 33-42/50 Content-length: 10 ... ...I think you should be able to figure it out from the above response. The total length of the file is send in the Content-Range field after the "/". So, first I request only 1 byte of data, which then gives me the length of the file and then proceed further to download it by parts.
We are ready with the idea and its a matter of coding the above idea. I have used the urllib2 from python, since I was too lazy to code it in C.
As an extension to this, I have used threads which increased the speed to large extent and it was as if I downloaded from my Local Area Network rather than the Internet.
As you may argue whats so special about this? There are many download accelerators that employ this method. But wait a minute. None of them employ the exact method I have described above. Indeed, well-known programs such as axel,wget and others failed to download the file when such constraint was introduced. So, I guess my idea is a bit better given these circumstances :).
Here is the code for my idea in Python:
import urllib2,sys,thread,time,tempfile,os
data=[]
def partial_download(url, st, en,idv):
global data
# print 'Thread:',str(idv),' for ',str(en-st+1),'bytes'
req = urllib2.Request(url)
req.headers["Range"]='bytes='+str(st)+'-'+str(en)
f = urllib2.urlopen(req)
fd = tempfile.NamedTemporaryFile(delete=False)
resp = ''
while 1:
stt = f.read()
if not stt:
break
resp += stt
fd.write(resp)
fd.close()
data.append([idv,fd])
print 'Thread:',str(idv),'finished getting ',str(en-st+1),'bytes to',fd.name
if len(sys.argv)<3:
print 'Format:[url] [parallel_download_count]'
sys.exit()
parallel_download_count = 1
parallel_download_count = int(sys.argv[2])
proxy = urllib2.ProxyHandler({'http': 'http://172.30.0.19:3128'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
link = sys.argv[1]
file_name = link.split("/")[-1:][0]
print file_name
#print link
req = urllib2.Request(link)
#first we need to know the content-length..
req.headers['Range'] = 'bytes=0-0'
f = urllib2.urlopen(req)
meta =f.info()
content_length = int(meta["Content-Range"].split('/')[1])
print 'File-size:',content_length
chunk_size = content_length/parallel_download_count
curr_count = 0
idc = 0
while curr_count+chunk_size<=content_length:
thread.start_new_thread(partial_download, (link,curr_count, curr_count+chunk_size-1,idc))
idc+=1
curr_count += chunk_size
if curr_count+chunk_size>content_length:
thread.start_new_thread(partial_download,(link,curr_count,content_length-1,idc))
idc+=1
while len(data)<idc:
time.sleep(1)
print 'Merging into single file...'
data.sort()
#file_type = meta['Content-Type'].split('/')[1]
fd =open(file_name, 'w')
for chunk in data:
tmp_fd = open(chunk[1].name,'r')
tmps = tmp_fd.read()
fd.write(tmps)
print 'Wrote',len(tmps),'bytes!'
tmp_fd.close()
os.unlink(chunk[1].name)
fd.close()
#print 'Length:',meta.getheaders("Content-Length")[0]
No comments:
Post a Comment