Tuesday 11 February 2014

How I downloaded large files from proxy server which imposed file-size limit.

In this post, I will be describing about one of the methods I discovered to download large files from a server which actually imposed file-size limit. So, Intuitively what do you think I am gonna do? I will be downloading it by parts. But how?

At first, I started testing how is the proxy server actually able to detect such requests. I found that it merely checks the length field of the incoming packet and throws an error if its size exceeds the maximum specified value.

Then I went through the HTTP Request Protocol. At some stage I came to know that there is a special field in the header called Range. With this we can actually request the start and end points of the bytes in the file in zero-based manner.

For example, if you want to download a file of size say 50 bytes(that's too tiny now-a-days). You only want bytes from 34 to 43 say. Then the HTTP request looks as follows:

GET file_name.extension HTTP/1.1
....
....
Range: bytes=33-42
...
...
It starts from 33 because it is zero-based, meaning the first byte starts at zero and so on. Now, I am able to figure out how to download a file by parts. But another question remains. What will be the size of the entire file? How to figure this out?

This problem was solved when I looked at the response from the server. For instance, the response looks as follows when the above request is sent:

HTTP/1.1 206 Partial content
....
....
Content-Range: bytes 33-42/50
Content-length: 10
...
...
I think you should be able to figure it out from the above response. The total length of the file is send in the Content-Range field after the "/". So, first I request only 1 byte of data, which then gives me the length of the file and then proceed further to download it by parts.

We are ready with the idea and its a matter of coding the above idea. I have used the urllib2 from python, since I was too lazy to code it in C.

As an extension to this, I have used threads which increased the speed to large extent and it was as if I downloaded from my Local Area Network rather than the Internet.

As you may argue whats so special about this? There are many download accelerators that employ this method. But wait a minute. None of them employ the exact method I have described above. Indeed, well-known programs such as axel,wget and others failed to download the file when such constraint was introduced. So, I guess my idea is a bit better given these circumstances :).

Here is the code for my idea in Python:

import urllib2,sys,thread,time,tempfile,os
data=[]
def partial_download(url, st, en,idv):
 global data
# print 'Thread:',str(idv),' for ',str(en-st+1),'bytes'
 req = urllib2.Request(url)
 req.headers["Range"]='bytes='+str(st)+'-'+str(en)
 f = urllib2.urlopen(req)
 fd = tempfile.NamedTemporaryFile(delete=False)
 resp = ''
 while 1: 
  stt = f.read()
  if not stt:
   break
  resp += stt
 fd.write(resp)
 fd.close()
 data.append([idv,fd])
 print 'Thread:',str(idv),'finished getting ',str(en-st+1),'bytes to',fd.name

if len(sys.argv)<3:
 print 'Format:[url] [parallel_download_count]'
 sys.exit()
parallel_download_count = 1


parallel_download_count = int(sys.argv[2])

proxy = urllib2.ProxyHandler({'http': 'http://172.30.0.19:3128'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
link = sys.argv[1]
file_name = link.split("/")[-1:][0]
print file_name
#print link
req = urllib2.Request(link)
#first we need to know the content-length..

req.headers['Range'] = 'bytes=0-0'
f = urllib2.urlopen(req)
meta =f.info()
content_length = int(meta["Content-Range"].split('/')[1])
print 'File-size:',content_length

chunk_size = content_length/parallel_download_count

curr_count = 0
idc = 0
while curr_count+chunk_size<=content_length:
 thread.start_new_thread(partial_download, (link,curr_count, curr_count+chunk_size-1,idc))
 idc+=1
 curr_count += chunk_size

if curr_count+chunk_size>content_length:
 thread.start_new_thread(partial_download,(link,curr_count,content_length-1,idc))
 idc+=1

while len(data)<idc:
 time.sleep(1)

print 'Merging into single file...'
data.sort()
#file_type = meta['Content-Type'].split('/')[1]
fd =open(file_name, 'w')

for chunk in data:
 tmp_fd = open(chunk[1].name,'r')
 tmps = tmp_fd.read()
 fd.write(tmps)
 print 'Wrote',len(tmps),'bytes!'
 tmp_fd.close()
 os.unlink(chunk[1].name)
fd.close()

#print 'Length:',meta.getheaders("Content-Length")[0]