Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to get COLMAP MVS above ~25% GPU power usage #2536

Open
Parskatt opened this issue Apr 24, 2024 · 15 comments
Open

Unable to get COLMAP MVS above ~25% GPU power usage #2536

Parskatt opened this issue Apr 24, 2024 · 15 comments

Comments

@Parskatt
Copy link

Parskatt commented Apr 24, 2024

Issue

Running patch_match_stereo on an A100 GPU the GPU "util" fills up, but the power usage is extremely low, see graph below:

image

This seems to indicate that there are some cuda kernels in patchmatch that take a lot of "clock-time" but are not using the GPU cores effectively.

Some attempts to fix I've tried

  • Tried running multiple MVS in multiprocess (above is from such a run). However, I think the 100% util is basically hard-limiting the performance.
  • Tried moving data to fast RAM (hoping it was IO issue), basically no difference.

I'm wondering if this is reproducible across systems. Is there any fix?

System

Docker container with: docker://colmap/colmap:20231001.8

@Parskatt
Copy link
Author

It's fine for me that it's a bit inefficient, but the cluster I'm running on automatically kills jobs that go below 25% power, which is quite frustrating, so I'd like to fix this.

@Parskatt
Copy link
Author

I got stuff to run faster by upping THREADS_PER_BLOCK from 32 -> 96 (64 also works fine), on my fork. This gave about 2x speedup for me. However, it's still just using about 30% power. It feels like there are major bottlenecks left. What is likely the culprit here @ahojnnes ? I can change in my fork and report back if you have any hunches.

@ahojnnes
Copy link
Contributor

The algorithm imposes that each row or column of an image uses one cuda thread. Depending on the size of your images, there will be many more cores than can be occupied by the image. You may be able to get more out or your GPU by simply listing the same GPU index of your a100 multiple times (though I have not tried this myself yet).

@Parskatt
Copy link
Author

@ahojnnes would this be similar to what I said regarding multiple processes in parallel on the same GPU? Because I had almost no success in speeding up with that approach. I will try my luck profiling the code to see what seems to be taking most time.

@ahojnnes
Copy link
Contributor

If you tried that, it should be mostly the same as my suggestion above. I am not too familiar with latest GPU architectures. The A100 does have specialized tensor cores for matrix multiplication that cannot be leveraged for patch match stereo, so this may explain the behavior you see.

@Parskatt
Copy link
Author

Parskatt commented Apr 25, 2024

image
Since SweepFromTopToBottom takes up basically entire computation time it's difficult to tell haha. Perhaps I can split up the kernel for the purpose of debugging?

Actually: I used the wrong profiler, apparently the one to use is nsight-compute (not nsight-systems), I'll try running the compute version and see if I can get a more detailed report. This is seemingly the way to do it.

EDIT2: I think this thread might reveal how stupid I am, I really don't know how to code.

@Parskatt
Copy link
Author

The algorithm imposes that each row or column of an image uses one cuda thread. Depending on the size of your images, there will be many more cores than can be occupied by the image. You may be able to get more out or your GPU by simply listing the same GPU index of your a100 multiple times (though I have not tried this myself yet).

Basically this seems like the culprit. From talking to people who know more about cuda than me, even if we don't use all the threads of the GPU, it is unavailable to additional processes (unless using cuda streams that appearently are complex).

image

@Parskatt
Copy link
Author

@ahojnnes I'm not sure I understand why THREADS_PER_BLOCK (https://github.com/colmap/colmap/blob/main/src/colmap/mvs/patch_match_cuda.cu#L45) has to be exactly 32? It seems to go into a lot of different places, but there does not seem to be any explicit place that would break for other values. So why does it seem to break for other values?

@ahojnnes
Copy link
Contributor

The algorithm imposes that each row or column of an image uses one cuda thread. Depending on the size of your images, there will be many more cores than can be occupied by the image. You may be able to get more out or your GPU by simply listing the same GPU index of your a100 multiple times (though I have not tried this myself yet).

Basically this seems like the culprit. From talking to people who know more about cuda than me, even if we don't use all the threads of the GPU, it is unavailable to additional processes (unless using cuda streams that appearently are complex).

image

Yes, this is correct definitely for older architectures. As I said, I don't know about latest GPU architectures and whether something changed meanwhile.

The choice of threads per block is related to the "warp size", which you can read up if you are interested. I'd be surprised if changing this value would improve runtime performance.

@Parskatt
Copy link
Author

@ahojnnes thanks. I'll report back if Im able to figure something out.

@jytime
Copy link

jytime commented Apr 30, 2024

I could verify the observation of @Parskatt , where changing THREADS_PER_BLOCK (and kMaxPatchMatchWindowRadius) from 32 to 96 can accelerate it by ~2.8 times on one A100 GPU (although I don't know the source and how it will affect the performance)

@Parskatt
Copy link
Author

I could verify the observation of @Parskatt , where changing THREADS_PER_BLOCK (and kMaxPatchMatchWindowRadius) from 32 to 96 can accelerate it by ~2.8 times on one A100 GPU (although I don't know the source and how it will affect the performance)

Can you verify that you actually get correct results? I found that actually the 3x faster results seemingly comes from complete failure (noise results). Still trying to understand why.

@jytime
Copy link

jytime commented Apr 30, 2024

@Parskatt after a double check, it seems the results are incorrect, because if I run pycolmap.stereo_fusion, the returned would be:

W20240430 14:41:27.907872 1927978 fusion.cc:335] Could not fuse any points. This is likely caused by incorrect settings - filtering must be enabled for the last call to patch match stereo.
I20240430 14:41:27.908705 1927978 fusion.cc:341] Number of fused points: 0

@Parskatt
Copy link
Author

Yeah it breaks it, but I dont get why. I started looking at some other stuff in the meantime :D

@jytime
Copy link

jytime commented Apr 30, 2024

true I am "encouraging" some people to build something like stereoanything, hope they make it fast enough lol

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants