-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multipart layer fetch #10177
base: main
Are you sure you want to change the base?
Multipart layer fetch #10177
Conversation
Hi @azr. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
2531c18
to
0748b0f
Compare
@azr Thanks for the PR, this looks promising. I wonder if you were able to get any memory usage data from your tests? Previous effort to use ECR containerd resolver , which has similar multipart layer download, showed that it can take up disproportionate amount of memory specially when we increase the number of parallel chunks(without providing significant benefit to latency). The high memory utilization was mainly from the Also can you share some information about your test image? Number of layers? Size of individual layers? |
/ok-to-test |
Hey @swagatbora90 , of course ! The theory in my mind is that this should use max/worst-case I think memory usage would be better if we were to directly write in parallel in a file at different positions, with 'holes'. And, sort of tell our progress to the checksumer with no-op writers that tell where we are, etc. (DL actually was so much faster this way in a test program I did, but it was not doing any unpacking, etc.) I also think it could be nice to be able to have a per registry parallelism setting, because not all registries are s3 backed, and docker.io seems to throttle things at 60mb/s. Topology of images: ~8GB imageFrom dive infos:
~27GB imageFrom dive infos:
Here are memory usages, where I'm periodically storing ~27GB image pull, max_concurrent_downloads: 2, 0 parallelism (before)~27GB image pull, max_concurrent_downloads: 2, 110 parallelism, 32mb chunksGC traces: 8GB image with `GODEBUG=gctrace=1`, parallelism set to 110 and chunksize set to 32
8GB image with `GODEBUG=gctrace=1`, parallelism set to 0 ( existing code )
Grpc tracing screenshots from the same run (8GB image with `GODEBUG=gctrace=1`, parallelism set to 110 and chunksize set to 32):Screenshot from another run for a ~27GB image, after a while, all chunks seem to take the same amount of time, ~22s, we've probably reached the writing speed burst limit, and are slowly taking more time to do things: |
This comment was marked as outdated.
This comment was marked as outdated.
c13969f
to
8fc47db
Compare
fetch big layers of images using more than one connection Signed-off-by: Adrien Delorme <azr@users.noreply.github.com>
@azr Thanks for adding the performance numbers. I ran some tests as well using your patch and the memory usage looks better than what I saw in the htcat implementation specially with high parallelism count. However, I do observe that increasing parallelism does not yield better latency and may lead to higher memory usage (I think there is a number of other factors to consider here mainly type of instance used for testing, network bandwidth). I tried to limit the test to a single image with a single layer and fixing the chunk size to 20 MB. A lower parallelism count(3 or 4) may be preferable than setting parallelism to upwards of 10. Using a c7.12xlarge instance to pull a 3GB single layer image from ECR private repo.
Also the network download time was much faster (see Network Pull time) (~15sec) while containerd took additional ~20secs to complete the pull (before it started unpacking). I calculated the Network Download time by periodically calling |
Thanks @azr, the numbers on this look good. @swagatbora90 super helpful stats. We should continue to respect For configuration, let's use the transfer service configuration. This won't make it in for 2.0 and CRI will be switching to transfer service by 2.1. |
8fc47db
to
504bd15
Compare
Hey @dmcgowan ! Nice, thanks. I have options in mind, and have to think/test of a good way to do this. I might have to introduce a |
TLDR: this makes pulls of big images ~2x faster, and closes #9922. Questions first, explanation second, metrics third, observations last.
cc: #8160, #4989
I have two (and a half) questions:
Hello Containerd People, I have this draft PR I would like to get your eyes on.
It basically makes pulls faster, but also tries to have not such a big memory impact, by getting consecutive chunks of the layers and immediately pushing them in the pipe (that writes to a file + that signature checksum thing).
I noticed it made pulls ~2x faster, when using the correct settings.
The settings have a big impact, and so I did a bunch of perf tests with different settings, here are some results on a ~8GB image using a
r6id.4xlarge
instance, pulling it from s3.Gains are somewhat similar on a ~27GB and a ~100GB image (with a little tiny bit of slowdown)
I also tried on an nvme, and a ebs drives, they are ofc slower but gains are still the same.
Metrics on a
r6id.4xlarge
timingcrictl pull
of a 8.6GB image.The first one with 13 tries is with 0 parallelism, it's the current code.
The rest are tries with different settings
tmpfs tests:
nvme (885GB) tests:
Observations, I did a little go program to multipart download big files directly into a file at different positions with different requests, and that was much faster than piping single threadedly into a file. Containerd pipes in a checksumer and then pipes into a file. I think that this can in some conditions create some sort of thrashing, hence why the parameters are very important here.
That simple go program had pretty bad perfs with one connection, but I was able to saturate the network with multiple connection, with better or on par perfs with aws-crt.
I think that for maximum perfs, we could try to re-architecture things a bit; like concurrently write directly into the tmpfile, and then tell the checksumer our progress, so that it can do that in parallel, and then carry on like usual.