Regularly, some administrators put down certain boundaries on the speed of downloading documents, this lessens the heap on the organization, and yet it is extremely irritating for clients, particularly when you want to download a huge record (from 1 GB), and the speed varies around 1 megabit each second (125 kilobytes each second). In view of these information, we presume that the download speed will be somewhere around 8192 seconds (2 hours 16 minutes 32 seconds). In spite of the fact that our data transmission permits us to move up to 16 Mbps (2 MB each second) and it will require 512 seconds (8 minutes 32 seconds).

These qualities ​​​​are not taken by risk, for such a download, at first I utilized only 4G Internet.

Case:

The utility created by me and spread out underneath – works provided that:

  • You know ahead of time that your data transmission is higher than the download speed
  • Indeed, even huge site pages load rapidly (the principal indication of misleadingly low speed)
  • You are not utilizing a sluggish intermediary or VPN
  • Great ping to the site

What are these limitations for?

  • streamlining of the backend and the arrival of static records
  • DDoS Protection

How is this stoppage executed?

Nginx

location /static/ {
   ...
   limit rate 50k; -> 50 kilobytes per second for a single connection 
   ...
}

location /videos/ {
   ...
   limit rate 500k; -> 500 kilobytes per second for a single connection
   limit_rate_after 10m; -> after 10 megabytes download speed, will 500 kilobytes per second for a single connection
   ...
}

Feature with zip file

A fascinating component was found while downloading a document in the compress augmentation, each part permits you to somewhat show the records in the file, albeit most archivers will say that the record is broken and not legitimate, a portion of the substance and document names will be shown.

Code parsing:

To make this program, we want Python, asyncio, aiohttp, aiofiles. All code will be nonconcurrent to increment execution and limit upward as far as memory and speed. It is likewise conceivable to run on strings and cycles, yet while stacking a huge document, mistakes might happen when a string or process can’t be made.

async def get_content_length(url):
    async with aiohttp.ClientSession() as session:
        async with session.head(url) as request:
            return request.content_length

This capacity returns the length of the record. Also the actual solicitation utilizes HEAD rather than GET, and that implies that we get just the headers, without the body (content at the given URL).

def parts_generator(size, start=0, part_size=10 * 1024 ** 2):
    while size - start > part_size:
        yield start, start + part_size
        start += part_size
    yield start, size

This generator returns ranges for download. A significant point is to pick part_size that is a different of 1024 to keep extents for every megabytes, in spite of the fact that it appears to be that any number will do. It doesn’t work accurately with part_size = 1, so I defaulted to 10 MB for every part.

async def download(url, headers, save_path):
    async with aiohttp.ClientSession(headers=headers) as session:
        async with session.get(url) as request:
            file = await aiofiles.open(save_path, 'wb')
            await file.write(await request.content.read())

One of the primary capacities is a document download. It works nonconcurrently. Here we really want nonconcurrent documents to accelerate plate composes by not impeding information and result tasks.

async def process(url):
    filename = os.path.basename(urlparse(url).path)
    tmp_dir = TemporaryDirectory(prefix=filename, dir=os.path.abspath('.'))
    size = await get_content_length(url)
    tasks = []
    file_parts = []
    for number, sizes in enumerate(parts_generator(size)):
        part_file_name = os.path.join(tmp_dir.name, f'{filename}.part{number}')
        file_parts.append(part_file_name)
        tasks.append(download(URL, {'Range': f'bytes={sizes[0]}-{sizes[1]}'}, part_file_name))
    await asyncio.gather(*tasks)
    with open(filename, 'wb') as wfd:
        for f in file_parts:
            with open(f, 'rb') as fd:
                shutil.copyfileobj(fd, wfd)

The most essential capacity gets the filename from the URL, changes over it into a numbered .part document, makes a brief index under the first record, all parts are downloaded into it. anticipate asyncio.gather(*tasks) permits you to execute all gathered coroutines simultaneously, which fundamentally speeds downloading. From that point onward, the all around simultaneous shutil.copyfileobj strategy links all documents into one record.

async def main():
    if len(sys.argv) <= 1:
        print('Add URLS')
        exit(1)
    urls = sys.argv[1:]
    await asyncio.gather(*[process(url) for url in urls])

The fundamental capacity gets a rundown of URLs from the order line and, utilizing the all around recognizable asyncio.gather, it begins downloading many records simultaneously.

Benchmark:

On one of the assets I found, a benchmark was completed on downloading a Gentoo Linux picture from a website of one university(slow server).

async: 164.682 seconds
sync: 453.545 seconds
Download DietPi distribution (fast server):
async: 17.106 seconds best time, 20.056 seconds worst time
sync: 15.897 seconds best time, 25.832 worst time

As may be obvious, the outcome arrives at practically 3x speed increase. On certain documents, the outcome arrived at 20-30 times.

Potential enhancements:

  • Safer download. Assuming there is a mistake, restart the download.
  • Memory improvement. One of the issues is a 2x expansion away space utilization. (whenever all parts are downloaded, replicated to another record, yet the index has not yet been erased). Effortlessly fixed by erasing the document following duplicating the substance of the part.
  • A few servers monitor the quantity of associations and can destroy such a heap, this requires stopping or extraordinarily expanding the size of the part.
  • Adding an advancement bar.


Taking everything into account, I can say that offbeat stacking is the exit plan, yet tragically not a silver slug in the issue of downloading records.

import asyncio
import os.path
import shutil

import aiofiles
import aiohttp
from tempfile import TemporaryDirectory
import sys
from urllib.parse import urlparse


async def get_content_length(url):
    async with aiohttp.ClientSession() as session:
        async with session.head(url) as request:
            return request.content_length


def parts_generator(size, start=0, part_size=10 * 1024 ** 2):
    while size - start > part_size:
        yield start, start + part_size
        start += part_size
    yield start, size


async def download(url, headers, save_path):
    async with aiohttp.ClientSession(headers=headers) as session:
        async with session.get(url) as request:
            file = await aiofiles.open(save_path, 'wb')
            await file.write(await request.content.read())


async def process(url):
    filename = os.path.basename(urlparse(url).path)
    tmp_dir = TemporaryDirectory(prefix=filename, dir=os.path.abspath('.'))
    size = await get_content_length(url)
    tasks = []
    file_parts = []
    for number, sizes in enumerate(parts_generator(size)):
        part_file_name = os.path.join(tmp_dir.name, f'{filename}.part{number}')
        file_parts.append(part_file_name)
        tasks.append(download(url, {'Range': f'bytes={sizes[0]}-{sizes[1]}'}, part_file_name))
    await asyncio.gather(*tasks)
    with open(filename, 'wb') as wfd:
        for f in file_parts:
            with open(f, 'rb') as fd:
                shutil.copyfileobj(fd, wfd)


async def main():
    if len(sys.argv) <= 1:
        print('Add URLS')
        exit(1)
    urls = sys.argv[1:]
    await asyncio.gather(*[process(url) for url in urls])


if __name__ == '__main__':
    import time

    start_code = time.monotonic()
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())
    print(f'{time.monotonic() - start_code} seconds!')

Author

Senior Python Developer.

loader