## Tuesday, 11 February 2014

### How I downloaded large files from proxy server which imposed file-size limit.

In this post, I will be describing about one of the methods I discovered to download large files from a server which actually imposed file-size limit. So, Intuitively what do you think I am gonna do? I will be downloading it by parts. But how?

At first, I started testing how is the proxy server actually able to detect such requests. I found that it merely checks the length field of the incoming packet and throws an error if its size exceeds the maximum specified value.

Then I went through the HTTP Request Protocol. At some stage I came to know that there is a special field in the header called Range. With this we can actually request the start and end points of the bytes in the file in zero-based manner.

For example, if you want to download a file of size say 50 bytes(that's too tiny now-a-days). You only want bytes from 34 to 43 say. Then the HTTP request looks as follows:

GET file_name.extension HTTP/1.1
....
....
Range: bytes=33-42
...
...

It starts from 33 because it is zero-based, meaning the first byte starts at zero and so on. Now, I am able to figure out how to download a file by parts. But another question remains. What will be the size of the entire file? How to figure this out?

This problem was solved when I looked at the response from the server. For instance, the response looks as follows when the above request is sent:

HTTP/1.1 206 Partial content
....
....
Content-Range: bytes 33-42/50
Content-length: 10
...
...

I think you should be able to figure it out from the above response. The total length of the file is send in the Content-Range field after the "/". So, first I request only 1 byte of data, which then gives me the length of the file and then proceed further to download it by parts.

We are ready with the idea and its a matter of coding the above idea. I have used the urllib2 from python, since I was too lazy to code it in C.

As an extension to this, I have used threads which increased the speed to large extent and it was as if I downloaded from my Local Area Network rather than the Internet.

As you may argue whats so special about this? There are many download accelerators that employ this method. But wait a minute. None of them employ the exact method I have described above. Indeed, well-known programs such as axel,wget and others failed to download the file when such constraint was introduced. So, I guess my idea is a bit better given these circumstances :).

Here is the code for my idea in Python:

data=[]
global data
req = urllib2.Request(url)
f = urllib2.urlopen(req)
fd = tempfile.NamedTemporaryFile(delete=False)
resp = ''
while 1:
if not stt:
break
resp += stt
fd.write(resp)
fd.close()
data.append([idv,fd])

if len(sys.argv)<3:
sys.exit()

proxy = urllib2.ProxyHandler({'http': 'http://172.30.0.19:3128'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
print file_name
#first we need to know the content-length..

f = urllib2.urlopen(req)
meta =f.info()
content_length = int(meta["Content-Range"].split('/')[1])
print 'File-size:',content_length

curr_count = 0
idc = 0
while curr_count+chunk_size<=content_length:
idc+=1
curr_count += chunk_size

if curr_count+chunk_size>content_length:
idc+=1

while len(data)<idc:
time.sleep(1)

print 'Merging into single file...'
data.sort()
#file_type = meta['Content-Type'].split('/')[1]
fd =open(file_name, 'w')

for chunk in data:
tmp_fd = open(chunk[1].name,'r')