New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving the efficiency of using multiple GPU cards. #4198
Comments
Environment="OLLAMA_NUM_PARALLEL=5" Three people simultaneously used and loaded llama 3, which was very fast. localhost.localdomain Mon May 6 19:12:05 2024 550.54.15 |
Observing this too #4212. Horrible perf regression |
Additionally, it would be awesome if it could also load across networked GPUs like Petals do. This would allow communities with older GPUs to combine their vram... especially locally where multiple computers are on a local network eg schools. |
Confirmed that downgrading to 0.1.31 resolves this issue for me as per @zhqfdn suggestion It must be the new GPU detection added post 0.1.31. Should it revert to the cudart GPU detection? Here's the comparison in the logs: 0.1.31
0.1.34rc1
I tested 0.1.33 ga and it had the same issue as per #4212 |
@zhqfdn this was an intentional design change. Based on our performance testing, if a model can fit in one GPU, we saw better performance loading it into 1 instead of unnecessarily splitting it into multiples. I'll put up a PR to support tuning this behavior, but at least based on my initial testing, it doesn't yield a performance benefit. If you can test and validate you see performance benefit that would help justify merging this. |
Before v0.1.32, when loading a model, it would be evenly distributed across all GPU cards to improve the use of GPU cards. In v0.1.32 and v0.1.33, it was found that loading a model would automatically use one card.
When used by multiple users simultaneously, it is slower. If evenly distributed across multiple GPU cards, it can improve the utilization rate of GPU cards and improve efficiency.
Tesla T4 GPU list
localhost.localdomain Mon May 6 18:41:30 2024 550.54.15
[0] Tesla T4 | 54°C, 93 % | 12238 / 15360 MB | ollama(12236M)
[1] Tesla T4 | 36°C, 0 % | 2 / 15360 MB |
[2] Tesla T4 | 30°C, 0 % | 2 / 15360 MB |
[3] Tesla T4 | 33°C, 0 % | 2 / 15360 MB |
ollama.service
`[Unit]
Description=Ollama Service
After=network-online.target
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_ORIGINS='*'"
Environment="OLLAMA_MODELS=/ollama/ollama/models"
Environment="OLLAMA_KEEP_ALIVE=10m"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="CUDA_VISIBLE_DEVICES=0,1,2,3"
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/root/.local/bin:/root/bin:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"
[Install]
WantedBy=default.target`
The text was updated successfully, but these errors were encountered: