In this post, I will be describing about one of the methods I discovered to download large files from a server which actually imposed file-size limit. So, Intuitively what do you think I am gonna do? I will be downloading it by parts. But how?
At first, I started testing how is the proxy server actually able to detect such requests. I found that it merely checks the length field of the incoming packet and throws an error if its size exceeds the maximum specified value.
Then I went through the HTTP Request Protocol. At some stage I came to know that there is a special field in the header called Range. With this we can actually request the start and end points of the bytes in the file in zero-based manner.
For example, if you want to download a file of size say 50 bytes(that's too tiny now-a-days). You only want bytes from 34 to 43 say. Then the HTTP request looks as follows:
This problem was solved when I looked at the response from the server. For instance, the response looks as follows when the above request is sent:
We are ready with the idea and its a matter of coding the above idea. I have used the urllib2 from python, since I was too lazy to code it in C.
As an extension to this, I have used threads which increased the speed to large extent and it was as if I downloaded from my Local Area Network rather than the Internet.
As you may argue whats so special about this? There are many download accelerators that employ this method. But wait a minute. None of them employ the exact method I have described above. Indeed, well-known programs such as axel,wget and others failed to download the file when such constraint was introduced. So, I guess my idea is a bit better given these circumstances :).
Here is the code for my idea in Python:
At first, I started testing how is the proxy server actually able to detect such requests. I found that it merely checks the length field of the incoming packet and throws an error if its size exceeds the maximum specified value.
Then I went through the HTTP Request Protocol. At some stage I came to know that there is a special field in the header called Range. With this we can actually request the start and end points of the bytes in the file in zero-based manner.
For example, if you want to download a file of size say 50 bytes(that's too tiny now-a-days). You only want bytes from 34 to 43 say. Then the HTTP request looks as follows:
GET file_name.extension HTTP/1.1 .... .... Range: bytes=33-42 ... ...It starts from 33 because it is zero-based, meaning the first byte starts at zero and so on. Now, I am able to figure out how to download a file by parts. But another question remains. What will be the size of the entire file? How to figure this out?
This problem was solved when I looked at the response from the server. For instance, the response looks as follows when the above request is sent:
HTTP/1.1 206 Partial content .... .... Content-Range: bytes 33-42/50 Content-length: 10 ... ...I think you should be able to figure it out from the above response. The total length of the file is send in the Content-Range field after the "/". So, first I request only 1 byte of data, which then gives me the length of the file and then proceed further to download it by parts.
We are ready with the idea and its a matter of coding the above idea. I have used the urllib2 from python, since I was too lazy to code it in C.
As an extension to this, I have used threads which increased the speed to large extent and it was as if I downloaded from my Local Area Network rather than the Internet.
As you may argue whats so special about this? There are many download accelerators that employ this method. But wait a minute. None of them employ the exact method I have described above. Indeed, well-known programs such as axel,wget and others failed to download the file when such constraint was introduced. So, I guess my idea is a bit better given these circumstances :).
Here is the code for my idea in Python:
import urllib2,sys,thread,time,tempfile,os data=[] def partial_download(url, st, en,idv): global data # print 'Thread:',str(idv),' for ',str(en-st+1),'bytes' req = urllib2.Request(url) req.headers["Range"]='bytes='+str(st)+'-'+str(en) f = urllib2.urlopen(req) fd = tempfile.NamedTemporaryFile(delete=False) resp = '' while 1: stt = f.read() if not stt: break resp += stt fd.write(resp) fd.close() data.append([idv,fd]) print 'Thread:',str(idv),'finished getting ',str(en-st+1),'bytes to',fd.name if len(sys.argv)<3: print 'Format:[url] [parallel_download_count]' sys.exit() parallel_download_count = 1 parallel_download_count = int(sys.argv[2]) proxy = urllib2.ProxyHandler({'http': 'http://172.30.0.19:3128'}) opener = urllib2.build_opener(proxy) urllib2.install_opener(opener) link = sys.argv[1] file_name = link.split("/")[-1:][0] print file_name #print link req = urllib2.Request(link) #first we need to know the content-length.. req.headers['Range'] = 'bytes=0-0' f = urllib2.urlopen(req) meta =f.info() content_length = int(meta["Content-Range"].split('/')[1]) print 'File-size:',content_length chunk_size = content_length/parallel_download_count curr_count = 0 idc = 0 while curr_count+chunk_size<=content_length: thread.start_new_thread(partial_download, (link,curr_count, curr_count+chunk_size-1,idc)) idc+=1 curr_count += chunk_size if curr_count+chunk_size>content_length: thread.start_new_thread(partial_download,(link,curr_count,content_length-1,idc)) idc+=1 while len(data)<idc: time.sleep(1) print 'Merging into single file...' data.sort() #file_type = meta['Content-Type'].split('/')[1] fd =open(file_name, 'w') for chunk in data: tmp_fd = open(chunk[1].name,'r') tmps = tmp_fd.read() fd.write(tmps) print 'Wrote',len(tmps),'bytes!' tmp_fd.close() os.unlink(chunk[1].name) fd.close() #print 'Length:',meta.getheaders("Content-Length")[0]
No comments:
Post a Comment