-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
only 1 GPU found -- regression 1.32 -> 1.33 #4139
Comments
Can you share more of the server log, ideally with OLLAMA_DEBUG=1 set so we can see the early bootstrapping GPU discovery logic. |
These are logs that I store automatically; so they don't have OLLAMA_DEBUG set. It's late here, so if these logs aren't helpful, I'll need to rerun it with DEBUG tomorrow. Ollama 1.32 Ollama 1.33 Thanks for your help! |
From the logs I can see that we did discover all 3 GPUs
Unfortunately without the debug set, I can't see why the scheduler decided to run on only a single GPU with only 3 layers. If you can re-run just the 0.1.33 with OLLAMA_DEBUG=1 and share the log that will help root cause the defect. |
@dhiltgen seems the log you referred is from ollama-1.32.log |
Not sure whether the issue comes from timing :) Enabling debug usually means more logging; More logging usually means timing changed. One way to confirm this is to run 1.33 without DEBUG enabled. |
Based on your 0.1.33 log with debug enabled.. It sees all 3 GPUs:
The scheduler determined the requested model could fit in a single GPU for best performance
and we can see the backend loaded all the layers
It is possible we have a scheduling race we haven't found/fixed yet since the scheduler code is brand new. If you manage to repro the failure mode of hitting a single GPU with partial offload, share the logs so we can see what the scheduler was doing. |
I had the same issue today and rolled back to 1.31 and this resolved the issue. I spent the day in the discord chatting with the users, trying various things without resolution. I was able to up num_gpu to the amount required and it will then find and utilize both GPUs.
|
@thevisad and @JieChenSimon from what I can tell, the system is behaving as expected in your examples. We try NOT to spread a single model over multiple GPUs now as that actually makes things run slower, not faster if the model could fit within one GPU. We now only spread a model to multiple GPUs if it wont fit in a single GPU. If that's not the behavior you're seeing, can you clarify? |
only one gpu in use after update to 1.33 |
EDIT: Turned out to be user error. My system's administrator for some reason decided to set the
|
I'm working on a change that will expose this setting in the logs during startup so it's easier to spot misconfigurations.
Update: my test was incorrect, CUDA_VISIBLE_DEVICES is still working properly. |
I have the same problem in docker, I have 13 gpus but it only find 1:
Inside the docker container:
|
@ToRvaLDz |
I'm sorry, you a re right. Thank you. |
I'm going to mark this one closed now as the visible devices env var seems to be working properly. I am working on some improvements in concurrency memory predictions that help when operating at near max vram allocation, which should land in an upcoming release. |
What is the issue?
Hi everyone,
Sorry I don't have much time to write much; but going from 1.32 to 1.33, this:
changed into this:
1.33 hammers my CPU cores, is generally slower and doesn't even utilize the one GPU it does find properly.
I need the new concurrency features, so I'd really appreciate it if 1.33 worked on my machine.
Please help.
OS
Linux
GPU
Nvidia
CPU
AMD
Ollama version
1.33
The text was updated successfully, but these errors were encountered: